🧊Understand ZeRO and Its Three Stages
Pencil-and-paper your way through ZeRO stages 1, 2, and 3 — sharding optimizer state, then gradients, then params — until you can pick a stage for a 13B model on 8 A100s and justify it from memory math, not vibes.
Phase 1Where GPU Memory Actually Goes
See what eats GPU memory before FLOPs do
Your GPU runs out of memory long before it runs out of FLOPs
7 minYour GPU runs out of memory long before it runs out of FLOPs
Plain data-parallel replicates everything — even the optimizer
6 minPlain data-parallel replicates everything — even the optimizer
ZeRO has three stages because there are three things to shard
7 minZeRO has three stages because there are three things to shard
Memory savings aren't free — every stage adds communication
7 minMemory savings aren't free — every stage adds communication
Phase 2Sharding Tensors One Stage at a Time
Shard one tensor at a time, counting comms
Shard one optimizer across four GPUs with a pencil
7 minShard one optimizer across four GPUs with a pencil
Adding gradients to the sharded list costs you nothing
7 minAdding gradients to the sharded list costs you nothing
Sharding parameters means every layer needs an all-gather
8 minSharding parameters means every layer needs an all-gather
ZeRO doesn't touch activations — that's a separate fight
7 minZeRO doesn't touch activations — that's a separate fight
Comms time = bytes ÷ bandwidth — and you can predict it
8 minComms time = bytes ÷ bandwidth — and you can predict it
Phase 3ZeRO Across Frameworks and Tiers
Map ZeRO onto FSDP and ZeRO-Infinity
Your config says FullyShard but the doc points to ZeRO-3 — which is it?
7 minYour config says FullyShard but the doc points to ZeRO-3 — which is it?
Your CPU RAM and NVMe are just slower GPUs
7 minYour CPU RAM and NVMe are just slower GPUs
ZeRO inside a node, pipeline across — and other Frankenstein configs
8 minZeRO inside a node, pipeline across — and other Frankenstein configs
The decision tree fits on one napkin
8 minThe decision tree fits on one napkin
Phase 4Pick a Stage and Defend It
Pick a stage for a 13B model on 8 A100s
Pick a stage for 13B on 8 A100s and write the memo
8 minPick a stage for 13B on 8 A100s and write the memo
Frequently asked questions
- What are the three stages of ZeRO?
- This is covered in the “Understand ZeRO and Its Three Stages” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
- What's the difference between ZeRO-1, ZeRO-2, and ZeRO-3?
- This is covered in the “Understand ZeRO and Its Three Stages” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
- How does ZeRO compare to FSDP in PyTorch?
- This is covered in the “Understand ZeRO and Its Three Stages” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
- Why does ZeRO-3 cost more communication than ZeRO-2?
- This is covered in the “Understand ZeRO and Its Three Stages” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
- When should I use ZeRO-Infinity instead of ZeRO-3?
- This is covered in the “Understand ZeRO and Its Three Stages” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
Related paths
🐍Python Decorators Introduction
Build one mental model for Python decorators that covers closures, argument passing, functools.wraps, and stacking — then ship a working caching or logging decorator from scratch in under 30 lines.
🦀Rust Lifetimes Explained
Stop reading `'a` as line noise and start reading it as scope arithmetic — one failing snippet at a time — until you can thread lifetimes through a small parser or iterator adapter without fighting the borrow checker.
☸️Kubernetes Core Concepts
Stop drowning in 30+ resource types. Build the mental model one primitive at a time -- pods, deployments, services, ingress, config -- then deploy a real app with rolling updates and health checks.
📈Big O Intuition
Stop treating Big O as math you memorized for an interview — build the intuition to spot O(n²) disasters, pick the right data structure without thinking, and rewrite a slow function from O(n²) to O(n) in under five minutes.