Back to library

🧠Understand Mixture-of-Experts (MoE) Architectures

Stop hearing 'experts vote' and start watching a single token route through a sparse layer — by the end you'll predict which inputs land on which expert in a small MoE you design yourself.

Advanced14 drops~2-week path · 5–8 min/daytechnology

Phase 1Why Sparsity — and What MoE Actually Activates

See why sparsity beats dense scaling for parameter budgets

4 drops
  1. Dense models pay for parameters they barely use

    6 min

    Dense models pay for parameters they barely use

  2. An expert is just a feed-forward block — nothing more

    6 min

    An expert is just a feed-forward block — nothing more

  3. The router is a one-line linear layer that picks experts

    7 min

    The router is a one-line linear layer that picks experts

  4. Most experts sit idle on every token — that's the feature

    6 min

    Most experts sit idle on every token — that's the feature

Phase 2Walk One Token Through a 4-Expert MoE

Walk one token through a 4-expert layer by hand

5 drops
  1. Pick a token, name your four experts, write down the gates

    6 min

    Pick a token, name your four experts, write down the gates

  2. Compute gating logits — one matmul, four numbers

    7 min

    Compute gating logits — one matmul, four numbers

  3. Top-2 routing throws away most of what the router said

    7 min

    Top-2 routing throws away most of what the router said

  4. Only the chosen experts compute — everyone else is silent

    7 min

    Only the chosen experts compute — everyone else is silent

  5. Combine expert outputs — weighted sum, then move on

    7 min

    Combine expert outputs — weighted sum, then move on

Phase 3Load Balancing, Expert Collapse, and Why Inference Wins

Read load balancing, expert collapse, and inference wins

4 drops
  1. Without an auxiliary loss, all your tokens go to one expert

    7 min

    Without an auxiliary loss, all your tokens go to one expert

  2. Auxiliary losses don't pick experts — they push for fairness

    7 min

    Auxiliary losses don't pick experts — they push for fairness

  3. MoE wins at inference and bleeds at training

    7 min

    MoE wins at inference and bleeds at training

  4. Experts specialize by surface features more than by topic

    7 min

    Experts specialize by surface features more than by topic

Phase 4Design a Small MoE — and Predict Its Routing

Design a small MoE and predict its routing patterns

1 drop
  1. Sketch a 4-expert MoE for your own task and predict where tokens land

    8 min

    Sketch a 4-expert MoE for your own task and predict where tokens land

Frequently asked questions

What is sparse activation in mixture-of-experts models?
This is covered in the “Understand Mixture-of-Experts (MoE) Architectures” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
How does the router decide which experts to send a token to?
This is covered in the “Understand Mixture-of-Experts (MoE) Architectures” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
Why does top-2 routing beat top-1 in practice?
This is covered in the “Understand Mixture-of-Experts (MoE) Architectures” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
What is expert collapse and how do load balancing losses fix it?
This is covered in the “Understand Mixture-of-Experts (MoE) Architectures” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
Why is MoE cheap at inference but tricky to train?
This is covered in the “Understand Mixture-of-Experts (MoE) Architectures” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.