Back to library

πŸ“ŠUnderstand LLM Benchmarks: MMLU, HumanEval, and Friends

Stop reading LLM benchmark scores like IQ tests. You'll learn what MMLU, HumanEval, GSM8K, MT-Bench, and friends actually measure, where each gets gamed, and how to rate a model release note's claims with calibrated skepticism.

Applied14 drops~2-week path Β· 5–8 min/daytechnology

Phase 1Why benchmarks exist and what they actually claim

See why benchmarks rose and where their authority comes from

4 drops
  1. Benchmarks are a contract, not a measurement

    6 min

    Benchmarks measure what the dataset's authors decided to test, on the inputs they decided were fair, scored by the metric they decided was meaningful.

  2. Leaderboards manufacture the illusion of progress

    6 min

    Leaderboards collapse multi-dimensional capability into a single rank by averaging or aggregating, which lets a model leapfrog by improving on the cheapest sub-tasks rather than the ones you care about.

  3. Benchmarks split into knowledge, reasoning, and chat

    6 min

    Modern LLM benchmarks fall into three families β€” knowledge recall, structured reasoning, and open-ended generation β€” and each family has its own gaming patterns and failure modes you need different tools to spot.

  4. When a benchmark hits 95%, it's stopped measuring

    6 min

    A benchmark stops measuring capability when scores cluster near the ceiling, because the remaining gap is dominated by ambiguous items, label noise, and lucky guesses rather than real skill differences.

Phase 2Pulling apart MMLU, HumanEval, and GSM8K

Pull apart MMLU, HumanEval, and GSM8K item by item

5 drops
  1. MMLU is 57 multiple-choice exams glued together

    7 min

    MMLU scores capture how well a model picks A/B/C/D on standardized-test questions across 57 academic subjects, which rewards memorization and four-way pattern-matching far more than reasoning or open-ended understanding.

  2. HumanEval is 164 toy problems your model could've memorized

    7 min

    HumanEval grades a model on completing 164 small, self-contained Python functions with hidden unit tests, which barely resembles the multi-file, ambiguous-spec, dependency-laden coding people actually do.

  3. GSM8K is grade-school word problems β€” and that's the point

    7 min

    GSM8K tests multi-step arithmetic word problems where success requires holding 2-8 sequential operations consistent, which exposes whether a model can chain reasoning rather than just pattern-match facts.

  4. MT-Bench and AlpacaEval grade vibes β€” sometimes well

    7 min

    MT-Bench and AlpacaEval use a strong LLM as the judge of open-ended responses, which gives you a chat-quality score that correlates with human preference but is biased by length, formatting, and similarity to the judge's own style.

  5. GPQA and ARC-AGI exist because the old benchmarks broke

    7 min

    GPQA, ARC-AGI, and HumanEval+ were designed specifically to resist memorization and force generalization, which is why their score curves look slow even when older benchmarks are saturating.

Phase 3Contamination, leaderboards, and held-out evals

Spot contamination, leaderboard overfitting, and held-out evals

4 drops
  1. The benchmark is in the training set β€” now what?

    7 min

    The benchmark is in the training set β€” now what?

  2. When the leaderboard becomes the loss function

    7 min

    When the leaderboard becomes the loss function

  3. Held-out evals: the only honest comparison left

    8 min

    Held-out evals: the only honest comparison left

  4. Dynamic benchmarks: when the test fights back

    8 min

    Dynamic benchmarks: when the test fights back

Phase 4Auditing a real model release note

Audit a real release note and rate the benchmark claims

1 drop
  1. Audit a real model release note end-to-end

    20 min

    Audit a real model release note end-to-end

Frequently asked questions

What does MMLU actually test and why does every model score on it?
This is covered in the β€œUnderstand LLM Benchmarks: MMLU, HumanEval, and Friends” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
How is HumanEval different from real-world coding ability?
This is covered in the β€œUnderstand LLM Benchmarks: MMLU, HumanEval, and Friends” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
What is benchmark contamination and how do you detect it?
This is covered in the β€œUnderstand LLM Benchmarks: MMLU, HumanEval, and Friends” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
Why do new models beat benchmarks without seeming smarter in practice?
This is covered in the β€œUnderstand LLM Benchmarks: MMLU, HumanEval, and Friends” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
Which LLM benchmarks should you trust in a release note?
This is covered in the β€œUnderstand LLM Benchmarks: MMLU, HumanEval, and Friends” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.