Back to library

📉Understand Chinchilla Scaling Laws and Compute-Optimal Training

Stop repeating '20 tokens per parameter' like a mantra and start picking N and D the way LLaMA-3's team does — by the end, you'll defend a compute budget split that ignores Chinchilla on purpose.

Applied14 drops~2-week path · 5–8 min/daytechnology

Phase 1Why Loss Bends to a Power Law

See why loss bends to a clean power law in compute

4 drops
  1. Loss isn't random — it bends on a ruler

    6 min

    Loss isn't random — it bends on a ruler

  2. C ≈ 6ND is the equation that rules every run

    6 min

    C ≈ 6ND is the equation that rules every run

  3. GPT-3 was massively undertrained — and nobody noticed for two years

    7 min

    GPT-3 was massively undertrained — and nobody noticed for two years

  4. Chinchilla used three different methods — and they all agreed

    7 min

    Chinchilla used three different methods — and they all agreed

Phase 2Reading One Row of the Chinchilla Table

Walk the Chinchilla table and derive the 20:1 ratio

5 drops
  1. Each row is a Lagrangian — fixed compute, free N and D

    7 min

    Each row is a Lagrangian — fixed compute, free N and D

  2. The loss formula has three terms — and one of them never goes away

    6 min

    The loss formula has three terms — and one of them never goes away

  3. Derive 20:1 with one Lagrangian and a clean cancellation

    8 min

    Derive 20:1 with one Lagrangian and a clean cancellation

  4. A 1e23 FLOP budget gives you ≈29B params on 580B tokens

    7 min

    A 1e23 FLOP budget gives you ≈29B params on 580B tokens

  5. Verify Chinchilla itself with two multiplications

    6 min

    Verify Chinchilla itself with two multiplications

Phase 3Why Production Trains Past Optimal

Understand why LLaMA and Mistral train far past optimal

4 drops
  1. Your CFO is staring at LLaMA-2-7B trained on 2T tokens. Why?

    7 min

    Your CFO is staring at LLaMA-2-7B trained on 2T tokens. Why?

  2. Find the inference-token threshold where overtraining pays off

    8 min

    Find the inference-token threshold where overtraining pays off

  3. LLaMA-3-8B was trained on 15T tokens. That's 250× Chinchilla.

    7 min

    LLaMA-3-8B was trained on 15T tokens. That's 250× Chinchilla.

  4. Past Chinchilla, every extra token is a bet on data quality

    7 min

    Past Chinchilla, every extra token is a bet on data quality

Phase 4Defend a 1e22 FLOP Budget Split

Split a 1e22 FLOP budget and defend your choice

1 drop
  1. Write the (N, D) memo: 1e22 FLOPs, inference-heavy product

    25 min

    Write the (N, D) memo: 1e22 FLOPs, inference-heavy product

Frequently asked questions

What is the Chinchilla scaling law in simple terms?
This is covered in the “Understand Chinchilla Scaling Laws and Compute-Optimal Training” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
Why is the magic ratio 20 tokens per parameter?
This is covered in the “Understand Chinchilla Scaling Laws and Compute-Optimal Training” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
If Chinchilla is optimal, why does LLaMA-3 train past it?
This is covered in the “Understand Chinchilla Scaling Laws and Compute-Optimal Training” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
What does compute-optimal actually optimize for?
This is covered in the “Understand Chinchilla Scaling Laws and Compute-Optimal Training” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
How do you pick N and D for a fixed training budget?
This is covered in the “Understand Chinchilla Scaling Laws and Compute-Optimal Training” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.