Back to library

🧭Understand MoE Routing and Load Balancing

Open the MoE router black box piece by piece — softmax gate, top-k, auxiliary loss, capacity factor, token dropping — until you can predict how capacity factor 1.0 versus 1.25 changes wasted compute and dropped tokens, then verify with an ablation.

Advanced14 drops~2-week path · 5–8 min/daytechnology

Phase 1From dense FFN to conditional routing

Why dense FFNs waste compute and what conditional routing replaces them with

4 drops
  1. A dense FFN runs every neuron on every token, even when it shouldn't

    6 min

    A dense FFN runs every neuron on every token, even when it shouldn't

  2. The router is a single linear layer plus softmax

    6 min

    The router is a single linear layer plus softmax

  3. Without a load-balancing loss, the router picks two experts forever

    7 min

    Without a load-balancing loss, the router picks two experts forever

  4. Capacity factor decides how many tokens each expert can refuse

    7 min

    Capacity factor decides how many tokens each expert can refuse

Phase 2Trace a token through the router

Trace one token through softmax gate, top-k, and auxiliary loss

5 drops
  1. From hidden state to gate probabilities in three lines of code

    6 min

    From hidden state to gate probabilities in three lines of code

  2. Switch routing sends each token to exactly one expert

    6 min

    Switch routing sends each token to exactly one expert

  3. The auxiliary loss is one tensor product, computed per batch

    7 min

    The auxiliary loss is one tensor product, computed per batch

  4. All-to-all is where routing becomes a distributed-systems problem

    8 min

    All-to-all is where routing becomes a distributed-systems problem

  5. Dropped tokens skip the MoE layer entirely — only the residual survives

    7 min

    Dropped tokens skip the MoE layer entirely — only the residual survives

Phase 3How real MoEs trade off routing choices

How Switch, Mixtral, and DeepSeek-MoE pick different points on the same axis

4 drops
  1. A teammate proposes 'just use top-1 like Switch' to halve training cost

    7 min

    A teammate proposes 'just use top-1 like Switch' to halve training cost

  2. Mixtral has 8 experts; DeepSeek-MoE has 64 — why pick a number?

    8 min

    Mixtral has 8 experts; DeepSeek-MoE has 64 — why pick a number?

  3. Training loss is fine, but the utilization plot shows two hot experts

    8 min

    Training loss is fine, but the utilization plot shows two hot experts

  4. Your inference batch is small — does the router behave the same?

    8 min

    Your inference batch is small — does the router behave the same?

Phase 4Predict and verify capacity-factor effects

Predict capacity-factor effects on an imbalanced batch, then ablate

1 drop
  1. Predict cf=1.0 vs cf=1.25 on an imbalanced batch, then ablate to verify

    12 min

    Predict cf=1.0 vs cf=1.25 on an imbalanced batch, then ablate to verify

Frequently asked questions

What is the router in a mixture-of-experts model actually doing?
This is covered in the “Understand MoE Routing and Load Balancing” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
Why does MoE training need an auxiliary load-balancing loss?
This is covered in the “Understand MoE Routing and Load Balancing” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
What does 'capacity factor 1.25' mean and why does raising it matter?
This is covered in the “Understand MoE Routing and Load Balancing” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
How is top-1 (Switch) routing different from top-2 (Mixtral) routing?
This is covered in the “Understand MoE Routing and Load Balancing” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
Why do MoE models drop tokens, and when is that acceptable?
This is covered in the “Understand MoE Routing and Load Balancing” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.