Question 1

What's the actual difference between bf16 and fp16?

Accepted Answer

This is covered in the "Understand bf16, fp16, and Loss Scaling" learning path on Droplet. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.

Question 2

Why does fp16 training NaN but bf16 doesn't?

Accepted Answer

This is covered in the "Understand bf16, fp16, and Loss Scaling" learning path on Droplet. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.

Question 3

Do I still need loss scaling with bf16?

Accepted Answer

This is covered in the "Understand bf16, fp16, and Loss Scaling" learning path on Droplet. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.

Question 4

Is bf16 always better than fp16 for training?

Accepted Answer

This is covered in the "Understand bf16, fp16, and Loss Scaling" learning path on Droplet. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.

Question 5

What's mixed precision and why use fp32 master weights?

Accepted Answer

This is covered in the "Understand bf16, fp16, and Loss Scaling" learning path on Droplet. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.

🔬Understand bf16, fp16, and Loss Scaling

Phase 1Inside a Float: Sign, Exponent, Mantissa

A float is three knobs, not one number

Exponent buys reach. Mantissa buys resolution.

fp16 has a basement at 6e-5

fp16 has a ceiling at 65,504

Phase 2Watching Gradients Survive or Disappear

A gradient that lives in fp32 dies in fp16

Loss scaling is a range-shift, not a magic constant

Mixed precision keeps a fp32 master copy of every weight

bf16 needs no loss scaler at all

Pick which gradient lives in which precision

Phase 3Why bf16 Took Over After the A100

Your team trained fp16 last year because their GPU couldn't do bf16

Your inference service runs fp16 — should it switch?

A LayerNorm in bf16 makes your loss curve weird

Google picked bf16 in 2017. Why did the industry wait?

Phase 4Diagnosing a NaN'ing Run

Prescribe a fix for a NaN'ing fp16 training run

Frequently asked questions

🐍Python Decorators Introduction

🦀Rust Lifetimes Explained

☸️Kubernetes Core Concepts

📈Big O Intuition

Phase 1Inside a Float: Sign, Exponent, Mantissa

A float is three knobs, not one number

Exponent buys reach. Mantissa buys resolution.

fp16 has a basement at 6e-5

fp16 has a ceiling at 65,504

Phase 2Watching Gradients Survive or Disappear

A gradient that lives in fp32 dies in fp16

Loss scaling is a range-shift, not a magic constant

Mixed precision keeps a fp32 master copy of every weight

bf16 needs no loss scaler at all

Pick which gradient lives in which precision

Phase 3Why bf16 Took Over After the A100

Your team trained fp16 last year because their GPU couldn't do bf16

Your inference service runs fp16 — should it switch?

A LayerNorm in bf16 makes your loss curve weird

Google picked bf16 in 2017. Why did the industry wait?

Phase 4Diagnosing a NaN'ing Run

Prescribe a fix for a NaN'ing fp16 training run

Frequently asked questions

Related paths

🐍Python Decorators Introduction

🦀Rust Lifetimes Explained

☸️Kubernetes Core Concepts

📈Big O Intuition