Question 1

What is sparse activation in mixture-of-experts models?

Accepted Answer

This is covered in the "Understand Mixture-of-Experts (MoE) Architectures" learning path on Droplet. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.

Question 2

How does the router decide which experts to send a token to?

Accepted Answer

This is covered in the "Understand Mixture-of-Experts (MoE) Architectures" learning path on Droplet. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.

Question 3

Why does top-2 routing beat top-1 in practice?

Accepted Answer

This is covered in the "Understand Mixture-of-Experts (MoE) Architectures" learning path on Droplet. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.

Question 4

What is expert collapse and how do load balancing losses fix it?

Accepted Answer

This is covered in the "Understand Mixture-of-Experts (MoE) Architectures" learning path on Droplet. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.

Question 5

Why is MoE cheap at inference but tricky to train?

Accepted Answer

This is covered in the "Understand Mixture-of-Experts (MoE) Architectures" learning path on Droplet. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.

🧠Understand Mixture-of-Experts (MoE) Architectures

Phase 1Why Sparsity — and What MoE Actually Activates

Dense models pay for parameters they barely use

An expert is just a feed-forward block — nothing more

The router is a one-line linear layer that picks experts

Most experts sit idle on every token — that's the feature

Phase 2Walk One Token Through a 4-Expert MoE

Pick a token, name your four experts, write down the gates

Compute gating logits — one matmul, four numbers

Top-2 routing throws away most of what the router said

Only the chosen experts compute — everyone else is silent

Combine expert outputs — weighted sum, then move on

Phase 3Load Balancing, Expert Collapse, and Why Inference Wins

Without an auxiliary loss, all your tokens go to one expert

Auxiliary losses don't pick experts — they push for fairness

MoE wins at inference and bleeds at training

Experts specialize by surface features more than by topic

Phase 4Design a Small MoE — and Predict Its Routing

Sketch a 4-expert MoE for your own task and predict where tokens land

Frequently asked questions

🐍Python Decorators Introduction

🦀Rust Lifetimes Explained

☸️Kubernetes Core Concepts

📈Big O Intuition

Phase 1Why Sparsity — and What MoE Actually Activates

Dense models pay for parameters they barely use

An expert is just a feed-forward block — nothing more

The router is a one-line linear layer that picks experts

Most experts sit idle on every token — that's the feature

Phase 2Walk One Token Through a 4-Expert MoE

Pick a token, name your four experts, write down the gates

Compute gating logits — one matmul, four numbers

Top-2 routing throws away most of what the router said

Only the chosen experts compute — everyone else is silent

Combine expert outputs — weighted sum, then move on

Phase 3Load Balancing, Expert Collapse, and Why Inference Wins

Without an auxiliary loss, all your tokens go to one expert

Auxiliary losses don't pick experts — they push for fairness

MoE wins at inference and bleeds at training

Experts specialize by surface features more than by topic

Phase 4Design a Small MoE — and Predict Its Routing

Sketch a 4-expert MoE for your own task and predict where tokens land

Frequently asked questions

Related paths

🐍Python Decorators Introduction

🦀Rust Lifetimes Explained

☸️Kubernetes Core Concepts

📈Big O Intuition