Back to library

📉Understand Gradient Descent

Stop treating the optimizer as a black box — walk a 2D loss surface by hand, feel why a learning rate that's too big diverges and one that's too small stalls, and learn to read SGD, momentum, and Adam loss curves the way a doctor reads a chart.

Applied14 drops~2-week path · 5–8 min/daytechnologymath

Phase 1Why Models Walk Downhill

See gradients as steepest-descent arrows on a real surface

4 drops
  1. Every model trains by walking downhill on a loss surface

    6 min

    Every model trains by walking downhill on a loss surface

  2. The gradient points uphill — you walk the other way

    6 min

    The gradient points uphill — you walk the other way

  3. The learning rate is your step size, not your speed

    6 min

    The learning rate is your step size, not your speed

  4. One equation runs every neural network on Earth

    7 min

    One equation runs every neural network on Earth

Phase 2Stepping Across a 2D Bowl by Hand

Step across a 2D bowl by hand and plot the path

5 drops
  1. A 2D bowl is the smallest model that teaches everything

    6 min

    A 2D bowl is the smallest model that teaches everything

  2. Take ten steps and watch the path bend toward zero

    7 min

    Take ten steps and watch the path bend toward zero

  3. Set η = 0.6 and watch the bowl spit you out

    7 min

    Set η = 0.6 and watch the bowl spit you out

  4. Set η = 0.001 and watch the bowl never end

    6 min

    Set η = 0.001 and watch the bowl never end

  5. Three runs, three curves, one equation

    6 min

    Three runs, three curves, one equation

Phase 3What SGD, Momentum, and Adam Each Fix

Compare what SGD, momentum, and Adam each fix

4 drops
  1. You don't have time to compute the full gradient

    6 min

    You don't have time to compute the full gradient

  2. Your model needs memory of where it was going

    6 min

    Your model needs memory of where it was going

  3. Different parameters need different learning rates

    7 min

    Different parameters need different learning rates

  4. The optimizer you pick is the one that fixes your bottleneck

    7 min

    The optimizer you pick is the one that fixes your bottleneck

Phase 4Diagnose Three Real Loss Curves

Diagnose three real loss curves like an ML engineer

1 drop
  1. Read three loss curves and prescribe the fix

    8 min

    Read three loss curves and prescribe the fix

Frequently asked questions

What is gradient descent in plain English?
This is covered in the “Understand Gradient Descent” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
Why does a learning rate that's too high cause loss to explode?
This is covered in the “Understand Gradient Descent” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
What's the difference between SGD, momentum, and Adam?
This is covered in the “Understand Gradient Descent” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
How do I tell from a loss curve that my learning rate is wrong?
This is covered in the “Understand Gradient Descent” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
Why does the loss sometimes plateau and then drop again?
This is covered in the “Understand Gradient Descent” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.