πUnderstand LLM Benchmarks: MMLU, HumanEval, and Friends
Stop reading LLM benchmark scores like IQ tests. You'll learn what MMLU, HumanEval, GSM8K, MT-Bench, and friends actually measure, where each gets gamed, and how to rate a model release note's claims with calibrated skepticism.
Phase 1Why benchmarks exist and what they actually claim
See why benchmarks rose and where their authority comes from
Benchmarks are a contract, not a measurement
6 minBenchmarks measure what the dataset's authors decided to test, on the inputs they decided were fair, scored by the metric they decided was meaningful.
Leaderboards manufacture the illusion of progress
6 minLeaderboards collapse multi-dimensional capability into a single rank by averaging or aggregating, which lets a model leapfrog by improving on the cheapest sub-tasks rather than the ones you care about.
Benchmarks split into knowledge, reasoning, and chat
6 minModern LLM benchmarks fall into three families β knowledge recall, structured reasoning, and open-ended generation β and each family has its own gaming patterns and failure modes you need different tools to spot.
When a benchmark hits 95%, it's stopped measuring
6 minA benchmark stops measuring capability when scores cluster near the ceiling, because the remaining gap is dominated by ambiguous items, label noise, and lucky guesses rather than real skill differences.
Phase 2Pulling apart MMLU, HumanEval, and GSM8K
Pull apart MMLU, HumanEval, and GSM8K item by item
MMLU is 57 multiple-choice exams glued together
7 minMMLU scores capture how well a model picks A/B/C/D on standardized-test questions across 57 academic subjects, which rewards memorization and four-way pattern-matching far more than reasoning or open-ended understanding.
HumanEval is 164 toy problems your model could've memorized
7 minHumanEval grades a model on completing 164 small, self-contained Python functions with hidden unit tests, which barely resembles the multi-file, ambiguous-spec, dependency-laden coding people actually do.
GSM8K is grade-school word problems β and that's the point
7 minGSM8K tests multi-step arithmetic word problems where success requires holding 2-8 sequential operations consistent, which exposes whether a model can chain reasoning rather than just pattern-match facts.
MT-Bench and AlpacaEval grade vibes β sometimes well
7 minMT-Bench and AlpacaEval use a strong LLM as the judge of open-ended responses, which gives you a chat-quality score that correlates with human preference but is biased by length, formatting, and similarity to the judge's own style.
GPQA and ARC-AGI exist because the old benchmarks broke
7 minGPQA, ARC-AGI, and HumanEval+ were designed specifically to resist memorization and force generalization, which is why their score curves look slow even when older benchmarks are saturating.
Phase 3Contamination, leaderboards, and held-out evals
Spot contamination, leaderboard overfitting, and held-out evals
The benchmark is in the training set β now what?
7 minThe benchmark is in the training set β now what?
When the leaderboard becomes the loss function
7 minWhen the leaderboard becomes the loss function
Held-out evals: the only honest comparison left
8 minHeld-out evals: the only honest comparison left
Dynamic benchmarks: when the test fights back
8 minDynamic benchmarks: when the test fights back
Phase 4Auditing a real model release note
Audit a real release note and rate the benchmark claims
Audit a real model release note end-to-end
20 minAudit a real model release note end-to-end
Frequently asked questions
- What does MMLU actually test and why does every model score on it?
- This is covered in the βUnderstand LLM Benchmarks: MMLU, HumanEval, and Friendsβ learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
- How is HumanEval different from real-world coding ability?
- This is covered in the βUnderstand LLM Benchmarks: MMLU, HumanEval, and Friendsβ learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
- What is benchmark contamination and how do you detect it?
- This is covered in the βUnderstand LLM Benchmarks: MMLU, HumanEval, and Friendsβ learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
- Why do new models beat benchmarks without seeming smarter in practice?
- This is covered in the βUnderstand LLM Benchmarks: MMLU, HumanEval, and Friendsβ learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
- Which LLM benchmarks should you trust in a release note?
- This is covered in the βUnderstand LLM Benchmarks: MMLU, HumanEval, and Friendsβ learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
Related paths
πPython Decorators Introduction
Build one mental model for Python decorators that covers closures, argument passing, functools.wraps, and stacking β then ship a working caching or logging decorator from scratch in under 30 lines.
π¦Rust Lifetimes Explained
Stop reading `'a` as line noise and start reading it as scope arithmetic β one failing snippet at a time β until you can thread lifetimes through a small parser or iterator adapter without fighting the borrow checker.
βΈοΈKubernetes Core Concepts
Stop drowning in 30+ resource types. Build the mental model one primitive at a time -- pods, deployments, services, ingress, config -- then deploy a real app with rolling updates and health checks.
πBig O Intuition
Stop treating Big O as math you memorized for an interview β build the intuition to spot O(nΒ²) disasters, pick the right data structure without thinking, and rewrite a slow function from O(nΒ²) to O(n) in under five minutes.