📉Understand Chinchilla Scaling Laws and Compute-Optimal Training
Stop repeating '20 tokens per parameter' like a mantra and start picking N and D the way LLaMA-3's team does — by the end, you'll defend a compute budget split that ignores Chinchilla on purpose.
Phase 1Why Loss Bends to a Power Law
See why loss bends to a clean power law in compute
Loss isn't random — it bends on a ruler
6 minLoss isn't random — it bends on a ruler
C ≈ 6ND is the equation that rules every run
6 minC ≈ 6ND is the equation that rules every run
GPT-3 was massively undertrained — and nobody noticed for two years
7 minGPT-3 was massively undertrained — and nobody noticed for two years
Chinchilla used three different methods — and they all agreed
7 minChinchilla used three different methods — and they all agreed
Phase 2Reading One Row of the Chinchilla Table
Walk the Chinchilla table and derive the 20:1 ratio
Each row is a Lagrangian — fixed compute, free N and D
7 minEach row is a Lagrangian — fixed compute, free N and D
The loss formula has three terms — and one of them never goes away
6 minThe loss formula has three terms — and one of them never goes away
Derive 20:1 with one Lagrangian and a clean cancellation
8 minDerive 20:1 with one Lagrangian and a clean cancellation
A 1e23 FLOP budget gives you ≈29B params on 580B tokens
7 minA 1e23 FLOP budget gives you ≈29B params on 580B tokens
Verify Chinchilla itself with two multiplications
6 minVerify Chinchilla itself with two multiplications
Phase 3Why Production Trains Past Optimal
Understand why LLaMA and Mistral train far past optimal
Your CFO is staring at LLaMA-2-7B trained on 2T tokens. Why?
7 minYour CFO is staring at LLaMA-2-7B trained on 2T tokens. Why?
Find the inference-token threshold where overtraining pays off
8 minFind the inference-token threshold where overtraining pays off
LLaMA-3-8B was trained on 15T tokens. That's 250× Chinchilla.
7 minLLaMA-3-8B was trained on 15T tokens. That's 250× Chinchilla.
Past Chinchilla, every extra token is a bet on data quality
7 minPast Chinchilla, every extra token is a bet on data quality
Phase 4Defend a 1e22 FLOP Budget Split
Split a 1e22 FLOP budget and defend your choice
Write the (N, D) memo: 1e22 FLOPs, inference-heavy product
25 minWrite the (N, D) memo: 1e22 FLOPs, inference-heavy product
Frequently asked questions
- What is the Chinchilla scaling law in simple terms?
- This is covered in the “Understand Chinchilla Scaling Laws and Compute-Optimal Training” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
- Why is the magic ratio 20 tokens per parameter?
- This is covered in the “Understand Chinchilla Scaling Laws and Compute-Optimal Training” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
- If Chinchilla is optimal, why does LLaMA-3 train past it?
- This is covered in the “Understand Chinchilla Scaling Laws and Compute-Optimal Training” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
- What does compute-optimal actually optimize for?
- This is covered in the “Understand Chinchilla Scaling Laws and Compute-Optimal Training” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
- How do you pick N and D for a fixed training budget?
- This is covered in the “Understand Chinchilla Scaling Laws and Compute-Optimal Training” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
Related paths
🐍Python Decorators Introduction
Build one mental model for Python decorators that covers closures, argument passing, functools.wraps, and stacking — then ship a working caching or logging decorator from scratch in under 30 lines.
🦀Rust Lifetimes Explained
Stop reading `'a` as line noise and start reading it as scope arithmetic — one failing snippet at a time — until you can thread lifetimes through a small parser or iterator adapter without fighting the borrow checker.
☸️Kubernetes Core Concepts
Stop drowning in 30+ resource types. Build the mental model one primitive at a time -- pods, deployments, services, ingress, config -- then deploy a real app with rolling updates and health checks.
📈Big O Intuition
Stop treating Big O as math you memorized for an interview — build the intuition to spot O(n²) disasters, pick the right data structure without thinking, and rewrite a slow function from O(n²) to O(n) in under five minutes.