What does MMLU actually test and why does every model score on it?

This is covered in the "Understand LLM Benchmarks: MMLU, HumanEval, and Friends" learning path on Droplet. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.

How is HumanEval different from real-world coding ability?

This is covered in the "Understand LLM Benchmarks: MMLU, HumanEval, and Friends" learning path on Droplet. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.

What is benchmark contamination and how do you detect it?

This is covered in the "Understand LLM Benchmarks: MMLU, HumanEval, and Friends" learning path on Droplet. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.

Why do new models beat benchmarks without seeming smarter in practice?

This is covered in the "Understand LLM Benchmarks: MMLU, HumanEval, and Friends" learning path on Droplet. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.

Which LLM benchmarks should you trust in a release note?

This is covered in the "Understand LLM Benchmarks: MMLU, HumanEval, and Friends" learning path on Droplet. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.

Back to library

📊Understand LLM Benchmarks: MMLU, HumanEval, and Friends

Stop reading LLM benchmark scores like IQ tests. You'll learn what MMLU, HumanEval, GSM8K, MT-Bench, and friends actually measure, where each gets gamed, and how to rate a model release note's claims with calibrated skepticism.

Applied14 drops~2-week path · 5–8 min/daytechnology

Phase 1Why benchmarks exist and what they actually claim

See why benchmarks rose and where their authority comes from

4 drops

Benchmarks are a contract, not a measurement
6 min
Benchmarks measure what the dataset's authors decided to test, on the inputs they decided were fair, scored by the metric they decided was meaningful.
Leaderboards manufacture the illusion of progress
6 min
Leaderboards collapse multi-dimensional capability into a single rank by averaging or aggregating, which lets a model leapfrog by improving on the cheapest sub-tasks rather than the ones you care about.
Benchmarks split into knowledge, reasoning, and chat
6 min
Modern LLM benchmarks fall into three families — knowledge recall, structured reasoning, and open-ended generation — and each family has its own gaming patterns and failure modes you need different tools to spot.
When a benchmark hits 95%, it's stopped measuring
6 min
A benchmark stops measuring capability when scores cluster near the ceiling, because the remaining gap is dominated by ambiguous items, label noise, and lucky guesses rather than real skill differences.

Phase 2Pulling apart MMLU, HumanEval, and GSM8K

Pull apart MMLU, HumanEval, and GSM8K item by item

5 drops

MMLU is 57 multiple-choice exams glued together
7 min
MMLU scores capture how well a model picks A/B/C/D on standardized-test questions across 57 academic subjects, which rewards memorization and four-way pattern-matching far more than reasoning or open-ended understanding.
HumanEval is 164 toy problems your model could've memorized
7 min
HumanEval grades a model on completing 164 small, self-contained Python functions with hidden unit tests, which barely resembles the multi-file, ambiguous-spec, dependency-laden coding people actually do.
GSM8K is grade-school word problems — and that's the point
7 min
GSM8K tests multi-step arithmetic word problems where success requires holding 2-8 sequential operations consistent, which exposes whether a model can chain reasoning rather than just pattern-match facts.
MT-Bench and AlpacaEval grade vibes — sometimes well
7 min
MT-Bench and AlpacaEval use a strong LLM as the judge of open-ended responses, which gives you a chat-quality score that correlates with human preference but is biased by length, formatting, and similarity to the judge's own style.
GPQA and ARC-AGI exist because the old benchmarks broke
7 min
GPQA, ARC-AGI, and HumanEval+ were designed specifically to resist memorization and force generalization, which is why their score curves look slow even when older benchmarks are saturating.

Phase 3Contamination, leaderboards, and held-out evals

Spot contamination, leaderboard overfitting, and held-out evals

4 drops

The benchmark is in the training set — now what?
7 min
The benchmark is in the training set — now what?
When the leaderboard becomes the loss function
7 min
When the leaderboard becomes the loss function
Held-out evals: the only honest comparison left
8 min
Held-out evals: the only honest comparison left
Dynamic benchmarks: when the test fights back
8 min
Dynamic benchmarks: when the test fights back

Phase 4Auditing a real model release note

Audit a real release note and rate the benchmark claims

1 drop

Audit a real model release note end-to-end
20 min
Audit a real model release note end-to-end

Frequently asked questions

What does MMLU actually test and why does every model score on it?: This is covered in the “Understand LLM Benchmarks: MMLU, HumanEval, and Friends” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
How is HumanEval different from real-world coding ability?: This is covered in the “Understand LLM Benchmarks: MMLU, HumanEval, and Friends” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
What is benchmark contamination and how do you detect it?: This is covered in the “Understand LLM Benchmarks: MMLU, HumanEval, and Friends” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
Why do new models beat benchmarks without seeming smarter in practice?: This is covered in the “Understand LLM Benchmarks: MMLU, HumanEval, and Friends” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
Which LLM benchmarks should you trust in a release note?: This is covered in the “Understand LLM Benchmarks: MMLU, HumanEval, and Friends” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.

📊Understand LLM Benchmarks: MMLU, HumanEval, and Friends

Phase 1Why benchmarks exist and what they actually claim

Benchmarks are a contract, not a measurement

Leaderboards manufacture the illusion of progress

Benchmarks split into knowledge, reasoning, and chat

When a benchmark hits 95%, it's stopped measuring

Phase 2Pulling apart MMLU, HumanEval, and GSM8K

MMLU is 57 multiple-choice exams glued together

HumanEval is 164 toy problems your model could've memorized

GSM8K is grade-school word problems — and that's the point

MT-Bench and AlpacaEval grade vibes — sometimes well

GPQA and ARC-AGI exist because the old benchmarks broke

Phase 3Contamination, leaderboards, and held-out evals

The benchmark is in the training set — now what?

When the leaderboard becomes the loss function

Held-out evals: the only honest comparison left

Dynamic benchmarks: when the test fights back

Phase 4Auditing a real model release note

Audit a real model release note end-to-end

Frequently asked questions

🐍Python Decorators Introduction

🦀Rust Lifetimes Explained

☸️Kubernetes Core Concepts

📈Big O Intuition

Phase 1Why benchmarks exist and what they actually claim

Benchmarks are a contract, not a measurement

Leaderboards manufacture the illusion of progress

Benchmarks split into knowledge, reasoning, and chat

When a benchmark hits 95%, it's stopped measuring

Phase 2Pulling apart MMLU, HumanEval, and GSM8K

MMLU is 57 multiple-choice exams glued together

HumanEval is 164 toy problems your model could've memorized

GSM8K is grade-school word problems — and that's the point

MT-Bench and AlpacaEval grade vibes — sometimes well

GPQA and ARC-AGI exist because the old benchmarks broke

Phase 3Contamination, leaderboards, and held-out evals

The benchmark is in the training set — now what?

When the leaderboard becomes the loss function

Held-out evals: the only honest comparison left

Dynamic benchmarks: when the test fights back

Phase 4Auditing a real model release note

Audit a real model release note end-to-end

Frequently asked questions

Related paths

🐍Python Decorators Introduction

🦀Rust Lifetimes Explained

☸️Kubernetes Core Concepts

📈Big O Intuition