🧮Understand Gradient Checkpointing
Stop guessing why gradient checkpointing tanks your throughput by 30% — learn to read the activation tape, pick the right granularity, and predict the compute overhead before you launch a single training run.
Phase 1Why Activations Eat Your Memory
See why activations — not weights — eat your training memory
Weights aren't what fills your GPU — activations are
6 minWeights aren't what fills your GPU — activations are
Forward writes the tape; backward reads it in reverse
6 minForward writes the tape; backward reads it in reverse
Forget on purpose, recompute on demand
6 minForget on purpose, recompute on demand
Every N layers is a dial, not a switch
7 minEvery N layers is a dial, not a switch
Phase 2Hand-Trace a Checkpointed Net
Hand-trace a 4-layer net and count peak activations
Count peak activations on a 4-layer net before any tricks
7 minCount peak activations on a 4-layer net before any tricks
Cut the saved set in half, do forward twice for half of it
7 minCut the saved set in half, do forward twice for half of it
Count recompute by op, not by layer count
7 minCount recompute by op, not by layer count
Peak memory hits once — average matters for throughput
6 minPeak memory hits once — average matters for throughput
Trace once for full, every-2, and every-layer — same net
7 minTrace once for full, every-2, and every-layer — same net
Phase 3Choosing the Right Policy
Pick full, selective, or offload — match policy to bottleneck
Your team enabled full checkpointing and lost 30% throughput
7 minYour team enabled full checkpointing and lost 30% throughput
Your 70B model needs more than checkpointing can give
7 minYour 70B model needs more than checkpointing can give
FlashAttention already recomputes attention — checkpointing twice is wasted work
7 minFlashAttention already recomputes attention — checkpointing twice is wasted work
Pick the cheap-FLOPs, high-memory ops to checkpoint
8 minPick the cheap-FLOPs, high-memory ops to checkpoint
Phase 4Prescribe a Real Checkpoint Policy
Diagnose a real OOM and prescribe a checkpoint policy
Take a real OOM and write the policy
18 minTake a real OOM and write the policy
Frequently asked questions
- What is gradient checkpointing in plain terms?
- This is covered in the “Understand Gradient Checkpointing” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
- How much memory does gradient checkpointing actually save?
- This is covered in the “Understand Gradient Checkpointing” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
- Why does gradient checkpointing slow down training by 20-30%?
- This is covered in the “Understand Gradient Checkpointing” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
- When should I use selective checkpointing instead of full?
- This is covered in the “Understand Gradient Checkpointing” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
- How is gradient checkpointing different from activation offloading?
- This is covered in the “Understand Gradient Checkpointing” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
Related paths
🐍Python Decorators Introduction
Build one mental model for Python decorators that covers closures, argument passing, functools.wraps, and stacking — then ship a working caching or logging decorator from scratch in under 30 lines.
🦀Rust Lifetimes Explained
Stop reading `'a` as line noise and start reading it as scope arithmetic — one failing snippet at a time — until you can thread lifetimes through a small parser or iterator adapter without fighting the borrow checker.
☸️Kubernetes Core Concepts
Stop drowning in 30+ resource types. Build the mental model one primitive at a time -- pods, deployments, services, ingress, config -- then deploy a real app with rolling updates and health checks.
📈Big O Intuition
Stop treating Big O as math you memorized for an interview — build the intuition to spot O(n²) disasters, pick the right data structure without thinking, and rewrite a slow function from O(n²) to O(n) in under five minutes.