🎯Understand Reward Hacking and Goodhart's Law in RLHF
Spot reward hacking in real model outputs — length bias, sycophancy, refusal escalation, sophistication bias — and pick the right mitigation (KL penalty, reward model ensembling, or process-based reward) for each failure mode.
Phase 1Why RLHF Games Its Own Reward
See why every RLHF pipeline games its own reward
The measure became the target — and stopped being a measure
6 minThe measure became the target — and stopped being a measure
Your reward model is a 7B opinion — and your policy is paid to flatter it
7 minYour reward model is a 7B opinion — and your policy is paid to flatter it
Reward model accuracy on the test set lies — once PPO starts running
7 minReward model accuracy on the test set lies — once PPO starts running
Length, sycophancy, refusals, sophistication — the four ways your RM lies
7 minLength, sycophancy, refusals, sophistication — the four ways your RM lies
Phase 2Spotting Reward Hacks in the Wild
Name the four failure modes from real outputs
Verbosity is the easiest hack — and the one your team will normalize
6 minVerbosity is the easiest hack — and the one your team will normalize
"You're absolutely right!" — the four words your model learned to print
7 min"You're absolutely right!" — the four words your model learned to print
When safety training overshoots, your model refuses to help you sharpen a knife
7 minWhen safety training overshoots, your model refuses to help you sharpen a knife
Confidence is rewarded; calibration isn't — that's why your model sounds smart and is sometimes wrong
8 minConfidence is rewarded; calibration isn't — that's why your model sounds smart and is sometimes wrong
One output, all four lenses — your first triage pass
8 minOne output, all four lenses — your first triage pass
Phase 3Mitigations and Their Trade-offs
Decide which mitigation fixes which failure mode
Gao's law: every nat of KL costs you preference accuracy — and you can predict the cliff
8 minGao's law: every nat of KL costs you preference accuracy — and you can predict the cliff
Tune β like a leash — too short, no learning; too long, full Goodhart
7 minTune β like a leash — too short, no learning; too long, full Goodhart
If one RM hallucinates a reward signal, three RMs trip over each other's hallucinations
8 minIf one RM hallucinates a reward signal, three RMs trip over each other's hallucinations
Reward the steps, not the answer — and reward hacking has nowhere to hide
8 minReward the steps, not the answer — and reward hacking has nowhere to hide
Phase 4Sketch a Mitigation Memo
Sketch a mitigation memo for one real failure
Write the one-page memo: failure mode, root cause, mitigation, predicted result
18 minWrite the one-page memo: failure mode, root cause, mitigation, predicted result
Frequently asked questions
- What is reward hacking in RLHF and why does it happen?
- This is covered in the “Understand Reward Hacking and Goodhart's Law in RLHF” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
- How does Goodhart's Law apply to language model fine-tuning?
- This is covered in the “Understand Reward Hacking and Goodhart's Law in RLHF” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
- Why do RLHF'd models become longer and more sycophantic?
- This is covered in the “Understand Reward Hacking and Goodhart's Law in RLHF” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
- What is reward model overoptimization and how is it measured?
- This is covered in the “Understand Reward Hacking and Goodhart's Law in RLHF” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
- Does a KL penalty actually prevent reward hacking?
- This is covered in the “Understand Reward Hacking and Goodhart's Law in RLHF” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
Related paths
🐍Python Decorators Introduction
Build one mental model for Python decorators that covers closures, argument passing, functools.wraps, and stacking — then ship a working caching or logging decorator from scratch in under 30 lines.
🦀Rust Lifetimes Explained
Stop reading `'a` as line noise and start reading it as scope arithmetic — one failing snippet at a time — until you can thread lifetimes through a small parser or iterator adapter without fighting the borrow checker.
☸️Kubernetes Core Concepts
Stop drowning in 30+ resource types. Build the mental model one primitive at a time -- pods, deployments, services, ingress, config -- then deploy a real app with rolling updates and health checks.
📈Big O Intuition
Stop treating Big O as math you memorized for an interview — build the intuition to spot O(n²) disasters, pick the right data structure without thinking, and rewrite a slow function from O(n²) to O(n) in under five minutes.