π€Understand RLHF: Reinforcement Learning from Human Feedback
Walk a single example through SFT, a reward model, and one PPO update so the RLHF loop stops feeling mythical. By the end, you'll sketch a preference-data pipeline for a real prompt in your own product.
Phase 1Why pretrained models are smart but uncooperative
See why pretrained models are smart but uncooperative
Pretrained models are smart but uncooperative
6 minPretrained models are smart but uncooperative
Supervised fine-tuning hits a wall fast
6 minSupervised fine-tuning hits a wall fast
RLHF is a three-stage relay, not one trick
7 minRLHF is a three-stage relay, not one trick
Comparisons beat ratings, and the math says why
6 minComparisons beat ratings, and the math says why
Phase 2Tracing one example through the RLHF loop
Trace one example through SFT, reward model, and PPO
Walk one prompt through SFT and watch the loss
7 minWalk one prompt through SFT and watch the loss
Build the world's smallest reward model in your head
7 minBuild the world's smallest reward model in your head
PPO is just policy gradient with a leash
7 minPPO is just policy gradient with a leash
Run one PPO step on a toy and watch the policy shift
8 minRun one PPO step on a toy and watch the policy shift
Stitch the three stages into one mental diagram
6 minStitch the three stages into one mental diagram
Phase 3DPO, constitutional AI, and online vs offline preferences
Compare DPO, constitutional AI, and online vs offline
Your team wants RLHF β but no PPO infrastructure
7 minYour team wants RLHF β but no PPO infrastructure
Your labelers can't keep up with your data needs
7 minYour labelers can't keep up with your data needs
Your aligned model regresses in production
7 minYour aligned model regresses in production
Your reward model has stopped helping
8 minYour reward model has stopped helping
Phase 4Sketching a preference pipeline for your product
Sketch a preference-data pipeline for your product
Sketch a preference-data pipeline for one real prompt
8 minSketch a preference-data pipeline for one real prompt
Frequently asked questions
- What does RLHF actually do that supervised fine-tuning can't?
- This is covered in the βUnderstand RLHF: Reinforcement Learning from Human Feedbackβ learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
- How does the reward model get trained from preference labels?
- This is covered in the βUnderstand RLHF: Reinforcement Learning from Human Feedbackβ learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
- Why does RLHF use PPO instead of plain gradient descent?
- This is covered in the βUnderstand RLHF: Reinforcement Learning from Human Feedbackβ learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
- How is DPO different from classic RLHF, and when should you use it?
- This is covered in the βUnderstand RLHF: Reinforcement Learning from Human Feedbackβ learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
- How many preference labels do you need to fine-tune a useful reward model?
- This is covered in the βUnderstand RLHF: Reinforcement Learning from Human Feedbackβ learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
Related paths
πPython Decorators Introduction
Build one mental model for Python decorators that covers closures, argument passing, functools.wraps, and stacking β then ship a working caching or logging decorator from scratch in under 30 lines.
π¦Rust Lifetimes Explained
Stop reading `'a` as line noise and start reading it as scope arithmetic β one failing snippet at a time β until you can thread lifetimes through a small parser or iterator adapter without fighting the borrow checker.
βΈοΈKubernetes Core Concepts
Stop drowning in 30+ resource types. Build the mental model one primitive at a time -- pods, deployments, services, ingress, config -- then deploy a real app with rolling updates and health checks.
πBig O Intuition
Stop treating Big O as math you memorized for an interview β build the intuition to spot O(nΒ²) disasters, pick the right data structure without thinking, and rewrite a slow function from O(nΒ²) to O(n) in under five minutes.