📉Understand Gradient Descent
Stop treating the optimizer as a black box — walk a 2D loss surface by hand, feel why a learning rate that's too big diverges and one that's too small stalls, and learn to read SGD, momentum, and Adam loss curves the way a doctor reads a chart.
Phase 1Why Models Walk Downhill
See gradients as steepest-descent arrows on a real surface
Every model trains by walking downhill on a loss surface
6 minEvery model trains by walking downhill on a loss surface
The gradient points uphill — you walk the other way
6 minThe gradient points uphill — you walk the other way
The learning rate is your step size, not your speed
6 minThe learning rate is your step size, not your speed
One equation runs every neural network on Earth
7 minOne equation runs every neural network on Earth
Phase 2Stepping Across a 2D Bowl by Hand
Step across a 2D bowl by hand and plot the path
A 2D bowl is the smallest model that teaches everything
6 minA 2D bowl is the smallest model that teaches everything
Take ten steps and watch the path bend toward zero
7 minTake ten steps and watch the path bend toward zero
Set η = 0.6 and watch the bowl spit you out
7 minSet η = 0.6 and watch the bowl spit you out
Set η = 0.001 and watch the bowl never end
6 minSet η = 0.001 and watch the bowl never end
Three runs, three curves, one equation
6 minThree runs, three curves, one equation
Phase 3What SGD, Momentum, and Adam Each Fix
Compare what SGD, momentum, and Adam each fix
You don't have time to compute the full gradient
6 minYou don't have time to compute the full gradient
Your model needs memory of where it was going
6 minYour model needs memory of where it was going
Different parameters need different learning rates
7 minDifferent parameters need different learning rates
The optimizer you pick is the one that fixes your bottleneck
7 minThe optimizer you pick is the one that fixes your bottleneck
Phase 4Diagnose Three Real Loss Curves
Diagnose three real loss curves like an ML engineer
Read three loss curves and prescribe the fix
8 minRead three loss curves and prescribe the fix
Frequently asked questions
- What is gradient descent in plain English?
- This is covered in the “Understand Gradient Descent” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
- Why does a learning rate that's too high cause loss to explode?
- This is covered in the “Understand Gradient Descent” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
- What's the difference between SGD, momentum, and Adam?
- This is covered in the “Understand Gradient Descent” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
- How do I tell from a loss curve that my learning rate is wrong?
- This is covered in the “Understand Gradient Descent” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
- Why does the loss sometimes plateau and then drop again?
- This is covered in the “Understand Gradient Descent” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
Related paths
🔵Learn Set Theory Basics: The Language Every Math Class Assumes
Stop squinting at ∪, ∩, and ∁ — shade Venn diagrams first, then translate them into clean notation, until you can model anything from music genres to probability events as sets you can actually picture.
🐍Python Decorators Introduction
Build one mental model for Python decorators that covers closures, argument passing, functools.wraps, and stacking — then ship a working caching or logging decorator from scratch in under 30 lines.
🦀Rust Lifetimes Explained
Stop reading `'a` as line noise and start reading it as scope arithmetic — one failing snippet at a time — until you can thread lifetimes through a small parser or iterator adapter without fighting the borrow checker.
📈Big O Intuition
Stop treating Big O as math you memorized for an interview — build the intuition to spot O(n²) disasters, pick the right data structure without thinking, and rewrite a slow function from O(n²) to O(n) in under five minutes.