Back to library

πŸ€–Understand RLHF: Reinforcement Learning from Human Feedback

Walk a single example through SFT, a reward model, and one PPO update so the RLHF loop stops feeling mythical. By the end, you'll sketch a preference-data pipeline for a real prompt in your own product.

Applied14 drops~2-week path Β· 5–8 min/daytechnology

Phase 1Why pretrained models are smart but uncooperative

See why pretrained models are smart but uncooperative

4 drops
  1. Pretrained models are smart but uncooperative

    6 min

    Pretrained models are smart but uncooperative

  2. Supervised fine-tuning hits a wall fast

    6 min

    Supervised fine-tuning hits a wall fast

  3. RLHF is a three-stage relay, not one trick

    7 min

    RLHF is a three-stage relay, not one trick

  4. Comparisons beat ratings, and the math says why

    6 min

    Comparisons beat ratings, and the math says why

Phase 2Tracing one example through the RLHF loop

Trace one example through SFT, reward model, and PPO

5 drops
  1. Walk one prompt through SFT and watch the loss

    7 min

    Walk one prompt through SFT and watch the loss

  2. Build the world's smallest reward model in your head

    7 min

    Build the world's smallest reward model in your head

  3. PPO is just policy gradient with a leash

    7 min

    PPO is just policy gradient with a leash

  4. Run one PPO step on a toy and watch the policy shift

    8 min

    Run one PPO step on a toy and watch the policy shift

  5. Stitch the three stages into one mental diagram

    6 min

    Stitch the three stages into one mental diagram

Phase 3DPO, constitutional AI, and online vs offline preferences

Compare DPO, constitutional AI, and online vs offline

4 drops
  1. Your team wants RLHF β€” but no PPO infrastructure

    7 min

    Your team wants RLHF β€” but no PPO infrastructure

  2. Your labelers can't keep up with your data needs

    7 min

    Your labelers can't keep up with your data needs

  3. Your aligned model regresses in production

    7 min

    Your aligned model regresses in production

  4. Your reward model has stopped helping

    8 min

    Your reward model has stopped helping

Phase 4Sketching a preference pipeline for your product

Sketch a preference-data pipeline for your product

1 drop
  1. Sketch a preference-data pipeline for one real prompt

    8 min

    Sketch a preference-data pipeline for one real prompt

Frequently asked questions

What does RLHF actually do that supervised fine-tuning can't?
This is covered in the β€œUnderstand RLHF: Reinforcement Learning from Human Feedback” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
How does the reward model get trained from preference labels?
This is covered in the β€œUnderstand RLHF: Reinforcement Learning from Human Feedback” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
Why does RLHF use PPO instead of plain gradient descent?
This is covered in the β€œUnderstand RLHF: Reinforcement Learning from Human Feedback” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
How is DPO different from classic RLHF, and when should you use it?
This is covered in the β€œUnderstand RLHF: Reinforcement Learning from Human Feedback” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
How many preference labels do you need to fine-tune a useful reward model?
This is covered in the β€œUnderstand RLHF: Reinforcement Learning from Human Feedback” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.