Back to library

🧪Understand Benchmark Saturation and Contamination

MMLU plateaued. HumanEval is in the training set. You'll separate saturation from contamination, run n-gram and perplexity checks on real test items, and design a holdout that's structurally hard to leak — defensible enough to put in front of a buyer.

Applied14 drops~2-week path · 5–8 min/daytechnology

Phase 1What saturation and contamination actually mean

What saturation and contamination actually mean

4 drops
  1. Saturation and contamination are not the same problem

    6 min

    Saturation and contamination are not the same problem

  2. Saturation is a shape, not a score

    7 min

    Saturation is a shape, not a score

  3. Contamination is a continuum, not a binary

    7 min

    Contamination is a continuum, not a binary

  4. The last 5 points of progress are probably not real

    7 min

    The last 5 points of progress are probably not real

Phase 2Detect contamination with n-grams, perplexity, canaries

Detect contamination with n-grams, perplexity, canaries

5 drops
  1. N-gram overlap is the cheapest contamination probe

    7 min

    N-gram overlap is the cheapest contamination probe

  2. Perplexity gaps reveal items the model has seen before

    8 min

    Perplexity gaps reveal items the model has seen before

  3. Canary strings are a contamination smoke alarm you install once

    6 min

    Canary strings are a contamination smoke alarm you install once

  4. No single probe is enough — triangulate three

    7 min

    No single probe is enough — triangulate three

  5. Run the audit before you cite the score

    7 min

    Run the audit before you cite the score

Phase 3Goodhart, Arena pressure, and dynamic evals

Goodhart, Arena pressure, and dynamic evals

4 drops
  1. A measure becomes a target — and stops measuring

    8 min

    A measure becomes a target — and stops measuring

  2. The leaderboard shapes the training mix

    7 min

    The leaderboard shapes the training mix

  3. Moving targets defeat the optimization loop

    8 min

    Moving targets defeat the optimization loop

  4. No single benchmark survives — portfolio them

    7 min

    No single benchmark survives — portfolio them

Phase 4Audit your eval set and design a leak-resistant holdout

Audit your eval set and design a leak-resistant holdout

1 drop
  1. Audit and rebuild a real eval — end to end

    25 min

    Audit and rebuild a real eval — end to end

Frequently asked questions

What's the difference between benchmark saturation and benchmark contamination?
This is covered in the “Understand Benchmark Saturation and Contamination” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
Why is MMLU no longer a useful signal of model progress?
This is covered in the “Understand Benchmark Saturation and Contamination” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
How do I detect if my eval set leaked into a model's pretraining data?
This is covered in the “Understand Benchmark Saturation and Contamination” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
What is a canary string and how do I use one in evals?
This is covered in the “Understand Benchmark Saturation and Contamination” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
Why do LiveBench and dynamic evals exist if static benchmarks are easier?
This is covered in the “Understand Benchmark Saturation and Contamination” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.