Back to library

🔬Understand bf16, fp16, and Loss Scaling

Stop flipping the precision flag and praying. You'll read a float as sign-exponent-mantissa, see exactly why fp16 NaNs and bf16 doesn't, and prescribe the right fix — loss scaling, bf16, or a mixed policy — for any training run.

Applied14 drops~2-week path · 5–8 min/daytechnology

Phase 1Inside a Float: Sign, Exponent, Mantissa

Read a float's bits and see range versus precision

4 drops
  1. A float is three knobs, not one number

    7 min

    A float is three knobs, not one number

  2. Exponent buys reach. Mantissa buys resolution.

    7 min

    Exponent buys reach. Mantissa buys resolution.

  3. fp16 has a basement at 6e-5

    6 min

    fp16 has a basement at 6e-5

  4. fp16 has a ceiling at 65,504

    6 min

    fp16 has a ceiling at 65,504

Phase 2Watching Gradients Survive or Disappear

Walk gradients through fp16, bf16, and loss scaling

5 drops
  1. A gradient that lives in fp32 dies in fp16

    7 min

    A gradient that lives in fp32 dies in fp16

  2. Loss scaling is a range-shift, not a magic constant

    7 min

    Loss scaling is a range-shift, not a magic constant

  3. Mixed precision keeps a fp32 master copy of every weight

    7 min

    Mixed precision keeps a fp32 master copy of every weight

  4. bf16 needs no loss scaler at all

    7 min

    bf16 needs no loss scaler at all

  5. Pick which gradient lives in which precision

    7 min

    Pick which gradient lives in which precision

Phase 3Why bf16 Took Over After the A100

Trace the post-A100 bf16 shift across real hardware

4 drops
  1. Your team trained fp16 last year because their GPU couldn't do bf16

    7 min

    Your team trained fp16 last year because their GPU couldn't do bf16

  2. Your inference service runs fp16 — should it switch?

    7 min

    Your inference service runs fp16 — should it switch?

  3. A LayerNorm in bf16 makes your loss curve weird

    7 min

    A LayerNorm in bf16 makes your loss curve weird

  4. Google picked bf16 in 2017. Why did the industry wait?

    7 min

    Google picked bf16 in 2017. Why did the industry wait?

Phase 4Diagnosing a NaN'ing Run

Diagnose a NaN'ing run and prescribe the fix

1 drop
  1. Prescribe a fix for a NaN'ing fp16 training run

    8 min

    Prescribe a fix for a NaN'ing fp16 training run

Frequently asked questions

What's the actual difference between bf16 and fp16?
This is covered in the “Understand bf16, fp16, and Loss Scaling” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
Why does fp16 training NaN but bf16 doesn't?
This is covered in the “Understand bf16, fp16, and Loss Scaling” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
Do I still need loss scaling with bf16?
This is covered in the “Understand bf16, fp16, and Loss Scaling” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
Is bf16 always better than fp16 for training?
This is covered in the “Understand bf16, fp16, and Loss Scaling” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
What's mixed precision and why use fp32 master weights?
This is covered in the “Understand bf16, fp16, and Loss Scaling” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.