What is the router in a mixture-of-experts model actually doing?

This is covered in the "Understand MoE Routing and Load Balancing" learning path on Droplet. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.

Why does MoE training need an auxiliary load-balancing loss?

This is covered in the "Understand MoE Routing and Load Balancing" learning path on Droplet. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.

What does 'capacity factor 1.25' mean and why does raising it matter?

This is covered in the "Understand MoE Routing and Load Balancing" learning path on Droplet. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.

How is top-1 (Switch) routing different from top-2 (Mixtral) routing?

This is covered in the "Understand MoE Routing and Load Balancing" learning path on Droplet. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.

Why do MoE models drop tokens, and when is that acceptable?

This is covered in the "Understand MoE Routing and Load Balancing" learning path on Droplet. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.

Back to library

🧭Understand MoE Routing and Load Balancing

Open the MoE router black box piece by piece — softmax gate, top-k, auxiliary loss, capacity factor, token dropping — until you can predict how capacity factor 1.0 versus 1.25 changes wasted compute and dropped tokens, then verify with an ablation.

Advanced14 drops~2-week path · 5–8 min/daytechnology

Phase 1From dense FFN to conditional routing

Why dense FFNs waste compute and what conditional routing replaces them with

4 drops

A dense FFN runs every neuron on every token, even when it shouldn't
6 min
A dense FFN runs every neuron on every token, even when it shouldn't
The router is a single linear layer plus softmax
6 min
The router is a single linear layer plus softmax
Without a load-balancing loss, the router picks two experts forever
7 min
Without a load-balancing loss, the router picks two experts forever
Capacity factor decides how many tokens each expert can refuse
7 min
Capacity factor decides how many tokens each expert can refuse

Phase 2Trace a token through the router

Trace one token through softmax gate, top-k, and auxiliary loss

5 drops

From hidden state to gate probabilities in three lines of code
6 min
From hidden state to gate probabilities in three lines of code
Switch routing sends each token to exactly one expert
6 min
Switch routing sends each token to exactly one expert
The auxiliary loss is one tensor product, computed per batch
7 min
The auxiliary loss is one tensor product, computed per batch
All-to-all is where routing becomes a distributed-systems problem
8 min
All-to-all is where routing becomes a distributed-systems problem
Dropped tokens skip the MoE layer entirely — only the residual survives
7 min
Dropped tokens skip the MoE layer entirely — only the residual survives

Phase 3How real MoEs trade off routing choices

How Switch, Mixtral, and DeepSeek-MoE pick different points on the same axis

4 drops

A teammate proposes 'just use top-1 like Switch' to halve training cost
7 min
A teammate proposes 'just use top-1 like Switch' to halve training cost
Mixtral has 8 experts; DeepSeek-MoE has 64 — why pick a number?
8 min
Mixtral has 8 experts; DeepSeek-MoE has 64 — why pick a number?
Training loss is fine, but the utilization plot shows two hot experts
8 min
Training loss is fine, but the utilization plot shows two hot experts
Your inference batch is small — does the router behave the same?
8 min
Your inference batch is small — does the router behave the same?

Phase 4Predict and verify capacity-factor effects

Predict capacity-factor effects on an imbalanced batch, then ablate

1 drop

Predict cf=1.0 vs cf=1.25 on an imbalanced batch, then ablate to verify
12 min
Predict cf=1.0 vs cf=1.25 on an imbalanced batch, then ablate to verify

Frequently asked questions

What is the router in a mixture-of-experts model actually doing?: This is covered in the “Understand MoE Routing and Load Balancing” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
Why does MoE training need an auxiliary load-balancing loss?: This is covered in the “Understand MoE Routing and Load Balancing” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
What does 'capacity factor 1.25' mean and why does raising it matter?: This is covered in the “Understand MoE Routing and Load Balancing” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
How is top-1 (Switch) routing different from top-2 (Mixtral) routing?: This is covered in the “Understand MoE Routing and Load Balancing” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
Why do MoE models drop tokens, and when is that acceptable?: This is covered in the “Understand MoE Routing and Load Balancing” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.

🧭Understand MoE Routing and Load Balancing

Phase 1From dense FFN to conditional routing

A dense FFN runs every neuron on every token, even when it shouldn't

The router is a single linear layer plus softmax

Without a load-balancing loss, the router picks two experts forever

Capacity factor decides how many tokens each expert can refuse

Phase 2Trace a token through the router

From hidden state to gate probabilities in three lines of code

Switch routing sends each token to exactly one expert

The auxiliary loss is one tensor product, computed per batch

All-to-all is where routing becomes a distributed-systems problem

Dropped tokens skip the MoE layer entirely — only the residual survives

Phase 3How real MoEs trade off routing choices

A teammate proposes 'just use top-1 like Switch' to halve training cost

Mixtral has 8 experts; DeepSeek-MoE has 64 — why pick a number?

Training loss is fine, but the utilization plot shows two hot experts

Your inference batch is small — does the router behave the same?

Phase 4Predict and verify capacity-factor effects

Predict cf=1.0 vs cf=1.25 on an imbalanced batch, then ablate to verify

Frequently asked questions

🐍Python Decorators Introduction

🦀Rust Lifetimes Explained

☸️Kubernetes Core Concepts

📈Big O Intuition

Phase 1From dense FFN to conditional routing

A dense FFN runs every neuron on every token, even when it shouldn't

The router is a single linear layer plus softmax

Without a load-balancing loss, the router picks two experts forever

Capacity factor decides how many tokens each expert can refuse

Phase 2Trace a token through the router

From hidden state to gate probabilities in three lines of code

Switch routing sends each token to exactly one expert

The auxiliary loss is one tensor product, computed per batch

All-to-all is where routing becomes a distributed-systems problem

Dropped tokens skip the MoE layer entirely — only the residual survives

Phase 3How real MoEs trade off routing choices

A teammate proposes 'just use top-1 like Switch' to halve training cost

Mixtral has 8 experts; DeepSeek-MoE has 64 — why pick a number?

Training loss is fine, but the utilization plot shows two hot experts

Your inference batch is small — does the router behave the same?

Phase 4Predict and verify capacity-factor effects

Predict cf=1.0 vs cf=1.25 on an imbalanced batch, then ablate to verify

Frequently asked questions

Related paths

🐍Python Decorators Introduction

🦀Rust Lifetimes Explained

☸️Kubernetes Core Concepts

📈Big O Intuition