š§®Understand Tensor Cores and Mixed Precision
Stop hand-waving about '100x faster than CUDA cores.' You'll trace one 4x4 tile through a tensor core's registers, multipliers, and FP32 accumulator, then estimate the real FLOPS uplift from switching one layer of your favorite model to mixed precision.
Phase 1Why Chips Built a Core Just for Matmul
See why a neural net is a stack of small matmuls
A neural net is a stack of small matmuls
6 minA neural net is a stack of small matmuls
CUDA cores do one FMA. Tensor cores do sixty-four.
6 minCUDA cores do one FMA. Tensor cores do sixty-four.
FP16 inputs, FP32 accumulator ā the whole magic
7 minFP16 inputs, FP32 accumulator ā the whole magic
Mixed precision isn't 'half the bytes' ā it's 'eight times the throughput'
7 minMixed precision isn't 'half the bytes' ā it's 'eight times the throughput'
Phase 2Walking a 4x4 Tile Through the Silicon
Trace a 4x4 FMA tile through the silicon by hand
The operand is a 4x4 tile, not a vector
7 minThe operand is a 4x4 tile, not a vector
Step 1: A and B land in registers
7 minStep 1: A and B land in registers
Step 2: Sixteen FP16 multiplies in one cycle
6 minStep 2: Sixteen FP16 multiplies in one cycle
Step 3: The FP32 accumulator catches every product
7 minStep 3: The FP32 accumulator catches every product
Step 4: D goes back to registers ā or back through another tile
7 minStep 4: D goes back to registers ā or back through another tile
Phase 3Volta to Blackwell, Generation by Generation
Walk Volta to Blackwell through real workloads
The training NaN'd. Ampere shipped BF16.
7 minThe training NaN'd. Ampere shipped BF16.
Your transformer wants FP8. H100 says yes.
7 minYour transformer wants FP8. H100 says yes.
Half your weights are zero. The tile knows.
7 minHalf your weights are zero. The tile knows.
FP4 sounds insane. Blackwell did it anyway.
7 minFP4 sounds insane. Blackwell did it anyway.
Phase 4Estimating Your Layer's FLOPS Uplift
Estimate one layer's mixed-precision FLOPS uplift
Estimate the FLOPS uplift on one real layer
18 minEstimate the FLOPS uplift on one real layer
Frequently asked questions
- What is a tensor core and how is it different from a CUDA core?
- This is covered in the āUnderstand Tensor Cores and Mixed Precisionā learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
- Why do tensor cores use FP16 inputs but accumulate in FP32?
- This is covered in the āUnderstand Tensor Cores and Mixed Precisionā learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
- What does a 4x4 fused multiply-add actually do inside the chip?
- This is covered in the āUnderstand Tensor Cores and Mixed Precisionā learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
- How much faster is mixed precision than FP32 on a real model layer?
- This is covered in the āUnderstand Tensor Cores and Mixed Precisionā learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
- What changed between Volta, Ampere, Hopper, and Blackwell tensor cores?
- This is covered in the āUnderstand Tensor Cores and Mixed Precisionā learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
Related paths
šPython Decorators Introduction
Build one mental model for Python decorators that covers closures, argument passing, functools.wraps, and stacking ā then ship a working caching or logging decorator from scratch in under 30 lines.
š¦Rust Lifetimes Explained
Stop reading `'a` as line noise and start reading it as scope arithmetic ā one failing snippet at a time ā until you can thread lifetimes through a small parser or iterator adapter without fighting the borrow checker.
āøļøKubernetes Core Concepts
Stop drowning in 30+ resource types. Build the mental model one primitive at a time -- pods, deployments, services, ingress, config -- then deploy a real app with rolling updates and health checks.
šBig O Intuition
Stop treating Big O as math you memorized for an interview ā build the intuition to spot O(n²) disasters, pick the right data structure without thinking, and rewrite a slow function from O(n²) to O(n) in under five minutes.