What is model distillation in machine learning?

This is covered in the "Understand Model Distillation" learning path on Droplet. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.

Why does a small model trained on a big model's outputs beat one trained on raw labels?

This is covered in the "Understand Model Distillation" learning path on Droplet. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.

What's the difference between offline and online distillation?

This is covered in the "Understand Model Distillation" learning path on Droplet. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.

How are reasoning traces distilled into smaller models like DeepSeek-R1-Distill?

This is covered in the "Understand Model Distillation" learning path on Droplet. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.

Can I distill a Claude or GPT prompt into a smaller open model legally?

This is covered in the "Understand Model Distillation" learning path on Droplet. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.

Back to library

🧪Understand Model Distillation

Stop treating model distillation as alchemy. Walk one teacher-student loop with a real loss function, then sketch a distillation plan to take one of your existing prompts to a smaller, cheaper model — by output, by reasoning trace, or by preference.

Applied14 drops~2-week path · 5–8 min/daytechnology

Phase 1Why Teacher Outputs Beat Raw Labels

See why teacher outputs beat raw labels for small models

4 drops

Small models don't get smart from raw data — they get smart from teachers
6 min
Small models don't get smart from raw data — they get smart from teachers
Hard labels throw away most of what the teacher knows
6 min
Hard labels throw away most of what the teacher knows
A small model's job is to compress, not to discover
6 min
A small model's job is to compress, not to discover
Distillation went mainstream because hosted teachers became cheap
6 min
Distillation went mainstream because hosted teachers became cheap

Phase 2One Student-Teacher Step, In Detail

Walk one student-teacher step with logits, temperature, and KL loss

5 drops

Distillation is a single loss with two terms
6 min
Distillation is a single loss with two terms
Temperature is what makes 'dark knowledge' visible
7 min
Temperature is what makes 'dark knowledge' visible
KL divergence is the loss for 'match my distribution'
6 min
KL divergence is the loss for 'match my distribution'
Walk one batch through the loop, end to end
7 min
Walk one batch through the loop, end to end
Three things that make a distillation run fail silently
7 min
Three things that make a distillation run fail silently

Phase 3Outputs, Traces, Preferences — Three Distillations

Compare output, trace, and preference distillation across regimes

4 drops

Scenario — distilling outputs for a customer-support classifier
7 min
Scenario — distilling outputs for a customer-support classifier
Scenario — distilling reasoning traces for a math tutor
7 min
Scenario — distilling reasoning traces for a math tutor
Scenario — distilling preferences for a writing assistant
7 min
Scenario — distilling preferences for a writing assistant
Scenario — choosing between offline and online distillation
7 min
Scenario — choosing between offline and online distillation

Phase 4Distill One of Your Real Prompts

Sketch a distillation plan for one of your real prompts

1 drop

Write the distillation plan for one of your real prompts
25 min
Write the distillation plan for one of your real prompts

Frequently asked questions

What is model distillation in machine learning?: This is covered in the “Understand Model Distillation” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
Why does a small model trained on a big model's outputs beat one trained on raw labels?: This is covered in the “Understand Model Distillation” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
What's the difference between offline and online distillation?: This is covered in the “Understand Model Distillation” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
How are reasoning traces distilled into smaller models like DeepSeek-R1-Distill?: This is covered in the “Understand Model Distillation” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
Can I distill a Claude or GPT prompt into a smaller open model legally?: This is covered in the “Understand Model Distillation” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.

🧪Understand Model Distillation

Phase 1Why Teacher Outputs Beat Raw Labels

Small models don't get smart from raw data — they get smart from teachers

Hard labels throw away most of what the teacher knows

A small model's job is to compress, not to discover

Distillation went mainstream because hosted teachers became cheap

Phase 2One Student-Teacher Step, In Detail

Distillation is a single loss with two terms

Temperature is what makes 'dark knowledge' visible

KL divergence is the loss for 'match my distribution'

Walk one batch through the loop, end to end

Three things that make a distillation run fail silently

Phase 3Outputs, Traces, Preferences — Three Distillations

Scenario — distilling outputs for a customer-support classifier

Scenario — distilling reasoning traces for a math tutor

Scenario — distilling preferences for a writing assistant

Scenario — choosing between offline and online distillation

Phase 4Distill One of Your Real Prompts

Write the distillation plan for one of your real prompts

Frequently asked questions

🐍Python Decorators Introduction

🦀Rust Lifetimes Explained

☸️Kubernetes Core Concepts

📈Big O Intuition

Phase 1Why Teacher Outputs Beat Raw Labels

Small models don't get smart from raw data — they get smart from teachers

Hard labels throw away most of what the teacher knows

A small model's job is to compress, not to discover

Distillation went mainstream because hosted teachers became cheap

Phase 2One Student-Teacher Step, In Detail

Distillation is a single loss with two terms

Temperature is what makes 'dark knowledge' visible

KL divergence is the loss for 'match my distribution'

Walk one batch through the loop, end to end

Three things that make a distillation run fail silently

Phase 3Outputs, Traces, Preferences — Three Distillations

Scenario — distilling outputs for a customer-support classifier

Scenario — distilling reasoning traces for a math tutor

Scenario — distilling preferences for a writing assistant

Scenario — choosing between offline and online distillation

Phase 4Distill One of Your Real Prompts

Write the distillation plan for one of your real prompts

Frequently asked questions

Related paths

🐍Python Decorators Introduction

🦀Rust Lifetimes Explained

☸️Kubernetes Core Concepts

📈Big O Intuition