Question 1

What is the Chinchilla scaling law in simple terms?

Accepted Answer

This is covered in the "Understand Chinchilla Scaling Laws and Compute-Optimal Training" learning path on Droplet. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.

Question 2

Why is the magic ratio 20 tokens per parameter?

Accepted Answer

This is covered in the "Understand Chinchilla Scaling Laws and Compute-Optimal Training" learning path on Droplet. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.

Question 3

If Chinchilla is optimal, why does LLaMA-3 train past it?

Accepted Answer

This is covered in the "Understand Chinchilla Scaling Laws and Compute-Optimal Training" learning path on Droplet. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.

Question 4

What does compute-optimal actually optimize for?

Accepted Answer

This is covered in the "Understand Chinchilla Scaling Laws and Compute-Optimal Training" learning path on Droplet. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.

Question 5

How do you pick N and D for a fixed training budget?

Accepted Answer

This is covered in the "Understand Chinchilla Scaling Laws and Compute-Optimal Training" learning path on Droplet. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.

📉Understand Chinchilla Scaling Laws and Compute-Optimal Training

Phase 1Why Loss Bends to a Power Law

Loss isn't random — it bends on a ruler

C ≈ 6ND is the equation that rules every run

GPT-3 was massively undertrained — and nobody noticed for two years

Chinchilla used three different methods — and they all agreed

Phase 2Reading One Row of the Chinchilla Table

Each row is a Lagrangian — fixed compute, free N and D

The loss formula has three terms — and one of them never goes away

Derive 20:1 with one Lagrangian and a clean cancellation

A 1e23 FLOP budget gives you ≈29B params on 580B tokens

Verify Chinchilla itself with two multiplications

Phase 3Why Production Trains Past Optimal

Your CFO is staring at LLaMA-2-7B trained on 2T tokens. Why?

Find the inference-token threshold where overtraining pays off

LLaMA-3-8B was trained on 15T tokens. That's 250× Chinchilla.

Past Chinchilla, every extra token is a bet on data quality

Phase 4Defend a 1e22 FLOP Budget Split

Write the (N, D) memo: 1e22 FLOPs, inference-heavy product

Frequently asked questions

🐍Python Decorators Introduction

🦀Rust Lifetimes Explained

☸️Kubernetes Core Concepts

📈Big O Intuition

Phase 1Why Loss Bends to a Power Law

Loss isn't random — it bends on a ruler

C ≈ 6ND is the equation that rules every run

GPT-3 was massively undertrained — and nobody noticed for two years

Chinchilla used three different methods — and they all agreed

Phase 2Reading One Row of the Chinchilla Table

Each row is a Lagrangian — fixed compute, free N and D

The loss formula has three terms — and one of them never goes away

Derive 20:1 with one Lagrangian and a clean cancellation

A 1e23 FLOP budget gives you ≈29B params on 580B tokens

Verify Chinchilla itself with two multiplications

Phase 3Why Production Trains Past Optimal

Your CFO is staring at LLaMA-2-7B trained on 2T tokens. Why?

Find the inference-token threshold where overtraining pays off

LLaMA-3-8B was trained on 15T tokens. That's 250× Chinchilla.

Past Chinchilla, every extra token is a bet on data quality

Phase 4Defend a 1e22 FLOP Budget Split

Write the (N, D) memo: 1e22 FLOPs, inference-heavy product

Frequently asked questions

Related paths

🐍Python Decorators Introduction

🦀Rust Lifetimes Explained

☸️Kubernetes Core Concepts

📈Big O Intuition