Back to library

🧪Understand Model Distillation

Stop treating model distillation as alchemy. Walk one teacher-student loop with a real loss function, then sketch a distillation plan to take one of your existing prompts to a smaller, cheaper model — by output, by reasoning trace, or by preference.

Applied14 drops~2-week path · 5–8 min/daytechnology

Phase 1Why Teacher Outputs Beat Raw Labels

See why teacher outputs beat raw labels for small models

4 drops
  1. Small models don't get smart from raw data — they get smart from teachers

    6 min

    Small models don't get smart from raw data — they get smart from teachers

  2. Hard labels throw away most of what the teacher knows

    6 min

    Hard labels throw away most of what the teacher knows

  3. A small model's job is to compress, not to discover

    6 min

    A small model's job is to compress, not to discover

  4. Distillation went mainstream because hosted teachers became cheap

    6 min

    Distillation went mainstream because hosted teachers became cheap

Phase 2One Student-Teacher Step, In Detail

Walk one student-teacher step with logits, temperature, and KL loss

5 drops
  1. Distillation is a single loss with two terms

    6 min

    Distillation is a single loss with two terms

  2. Temperature is what makes 'dark knowledge' visible

    7 min

    Temperature is what makes 'dark knowledge' visible

  3. KL divergence is the loss for 'match my distribution'

    6 min

    KL divergence is the loss for 'match my distribution'

  4. Walk one batch through the loop, end to end

    7 min

    Walk one batch through the loop, end to end

  5. Three things that make a distillation run fail silently

    7 min

    Three things that make a distillation run fail silently

Phase 3Outputs, Traces, Preferences — Three Distillations

Compare output, trace, and preference distillation across regimes

4 drops
  1. Scenario — distilling outputs for a customer-support classifier

    7 min

    Scenario — distilling outputs for a customer-support classifier

  2. Scenario — distilling reasoning traces for a math tutor

    7 min

    Scenario — distilling reasoning traces for a math tutor

  3. Scenario — distilling preferences for a writing assistant

    7 min

    Scenario — distilling preferences for a writing assistant

  4. Scenario — choosing between offline and online distillation

    7 min

    Scenario — choosing between offline and online distillation

Phase 4Distill One of Your Real Prompts

Sketch a distillation plan for one of your real prompts

1 drop
  1. Write the distillation plan for one of your real prompts

    25 min

    Write the distillation plan for one of your real prompts

Frequently asked questions

What is model distillation in machine learning?
This is covered in the “Understand Model Distillation” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
Why does a small model trained on a big model's outputs beat one trained on raw labels?
This is covered in the “Understand Model Distillation” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
What's the difference between offline and online distillation?
This is covered in the “Understand Model Distillation” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
How are reasoning traces distilled into smaller models like DeepSeek-R1-Distill?
This is covered in the “Understand Model Distillation” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
Can I distill a Claude or GPT prompt into a smaller open model legally?
This is covered in the “Understand Model Distillation” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.