🧠Understand Mixture-of-Experts (MoE) Architectures
Stop hearing 'experts vote' and start watching a single token route through a sparse layer — by the end you'll predict which inputs land on which expert in a small MoE you design yourself.
Phase 1Why Sparsity — and What MoE Actually Activates
See why sparsity beats dense scaling for parameter budgets
Dense models pay for parameters they barely use
6 minDense models pay for parameters they barely use
An expert is just a feed-forward block — nothing more
6 minAn expert is just a feed-forward block — nothing more
The router is a one-line linear layer that picks experts
7 minThe router is a one-line linear layer that picks experts
Most experts sit idle on every token — that's the feature
6 minMost experts sit idle on every token — that's the feature
Phase 2Walk One Token Through a 4-Expert MoE
Walk one token through a 4-expert layer by hand
Pick a token, name your four experts, write down the gates
6 minPick a token, name your four experts, write down the gates
Compute gating logits — one matmul, four numbers
7 minCompute gating logits — one matmul, four numbers
Top-2 routing throws away most of what the router said
7 minTop-2 routing throws away most of what the router said
Only the chosen experts compute — everyone else is silent
7 minOnly the chosen experts compute — everyone else is silent
Combine expert outputs — weighted sum, then move on
7 minCombine expert outputs — weighted sum, then move on
Phase 3Load Balancing, Expert Collapse, and Why Inference Wins
Read load balancing, expert collapse, and inference wins
Without an auxiliary loss, all your tokens go to one expert
7 minWithout an auxiliary loss, all your tokens go to one expert
Auxiliary losses don't pick experts — they push for fairness
7 minAuxiliary losses don't pick experts — they push for fairness
MoE wins at inference and bleeds at training
7 minMoE wins at inference and bleeds at training
Experts specialize by surface features more than by topic
7 minExperts specialize by surface features more than by topic
Phase 4Design a Small MoE — and Predict Its Routing
Design a small MoE and predict its routing patterns
Sketch a 4-expert MoE for your own task and predict where tokens land
8 minSketch a 4-expert MoE for your own task and predict where tokens land
Frequently asked questions
- What is sparse activation in mixture-of-experts models?
- This is covered in the “Understand Mixture-of-Experts (MoE) Architectures” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
- How does the router decide which experts to send a token to?
- This is covered in the “Understand Mixture-of-Experts (MoE) Architectures” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
- Why does top-2 routing beat top-1 in practice?
- This is covered in the “Understand Mixture-of-Experts (MoE) Architectures” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
- What is expert collapse and how do load balancing losses fix it?
- This is covered in the “Understand Mixture-of-Experts (MoE) Architectures” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
- Why is MoE cheap at inference but tricky to train?
- This is covered in the “Understand Mixture-of-Experts (MoE) Architectures” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
Related paths
🐍Python Decorators Introduction
Build one mental model for Python decorators that covers closures, argument passing, functools.wraps, and stacking — then ship a working caching or logging decorator from scratch in under 30 lines.
🦀Rust Lifetimes Explained
Stop reading `'a` as line noise and start reading it as scope arithmetic — one failing snippet at a time — until you can thread lifetimes through a small parser or iterator adapter without fighting the borrow checker.
☸️Kubernetes Core Concepts
Stop drowning in 30+ resource types. Build the mental model one primitive at a time -- pods, deployments, services, ingress, config -- then deploy a real app with rolling updates and health checks.
📈Big O Intuition
Stop treating Big O as math you memorized for an interview — build the intuition to spot O(n²) disasters, pick the right data structure without thinking, and rewrite a slow function from O(n²) to O(n) in under five minutes.