🧪Understand Model Distillation
Stop treating model distillation as alchemy. Walk one teacher-student loop with a real loss function, then sketch a distillation plan to take one of your existing prompts to a smaller, cheaper model — by output, by reasoning trace, or by preference.
Phase 1Why Teacher Outputs Beat Raw Labels
See why teacher outputs beat raw labels for small models
Small models don't get smart from raw data — they get smart from teachers
6 minSmall models don't get smart from raw data — they get smart from teachers
Hard labels throw away most of what the teacher knows
6 minHard labels throw away most of what the teacher knows
A small model's job is to compress, not to discover
6 minA small model's job is to compress, not to discover
Distillation went mainstream because hosted teachers became cheap
6 minDistillation went mainstream because hosted teachers became cheap
Phase 2One Student-Teacher Step, In Detail
Walk one student-teacher step with logits, temperature, and KL loss
Distillation is a single loss with two terms
6 minDistillation is a single loss with two terms
Temperature is what makes 'dark knowledge' visible
7 minTemperature is what makes 'dark knowledge' visible
KL divergence is the loss for 'match my distribution'
6 minKL divergence is the loss for 'match my distribution'
Walk one batch through the loop, end to end
7 minWalk one batch through the loop, end to end
Three things that make a distillation run fail silently
7 minThree things that make a distillation run fail silently
Phase 3Outputs, Traces, Preferences — Three Distillations
Compare output, trace, and preference distillation across regimes
Scenario — distilling outputs for a customer-support classifier
7 minScenario — distilling outputs for a customer-support classifier
Scenario — distilling reasoning traces for a math tutor
7 minScenario — distilling reasoning traces for a math tutor
Scenario — distilling preferences for a writing assistant
7 minScenario — distilling preferences for a writing assistant
Scenario — choosing between offline and online distillation
7 minScenario — choosing between offline and online distillation
Phase 4Distill One of Your Real Prompts
Sketch a distillation plan for one of your real prompts
Write the distillation plan for one of your real prompts
25 minWrite the distillation plan for one of your real prompts
Frequently asked questions
- What is model distillation in machine learning?
- This is covered in the “Understand Model Distillation” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
- Why does a small model trained on a big model's outputs beat one trained on raw labels?
- This is covered in the “Understand Model Distillation” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
- What's the difference between offline and online distillation?
- This is covered in the “Understand Model Distillation” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
- How are reasoning traces distilled into smaller models like DeepSeek-R1-Distill?
- This is covered in the “Understand Model Distillation” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
- Can I distill a Claude or GPT prompt into a smaller open model legally?
- This is covered in the “Understand Model Distillation” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
Related paths
🐍Python Decorators Introduction
Build one mental model for Python decorators that covers closures, argument passing, functools.wraps, and stacking — then ship a working caching or logging decorator from scratch in under 30 lines.
🦀Rust Lifetimes Explained
Stop reading `'a` as line noise and start reading it as scope arithmetic — one failing snippet at a time — until you can thread lifetimes through a small parser or iterator adapter without fighting the borrow checker.
☸️Kubernetes Core Concepts
Stop drowning in 30+ resource types. Build the mental model one primitive at a time -- pods, deployments, services, ingress, config -- then deploy a real app with rolling updates and health checks.
📈Big O Intuition
Stop treating Big O as math you memorized for an interview — build the intuition to spot O(n²) disasters, pick the right data structure without thinking, and rewrite a slow function from O(n²) to O(n) in under five minutes.