4 lessons

Mixture of Experts

Sparse activation — the next axis of scale

Lessons

01
MoE Fundamentals
Why sparse activation lets you scale parameters without scaling FLOPs.
MediumOpen
02
Top-k Routing
The gating network that picks which experts see each token.
HardOpen
03
Load Balancing Loss
Preventing expert collapse — keep every expert busy.
HardOpen
04
Expert Parallelism
Distributing experts across GPUs at training time.
HardOpen