All Courses
4 lessons
Mixture of Experts
Sparse activation — the next axis of scale
Lessons
- 01
MoE Fundamentals
Why sparse activation lets you scale parameters without scaling FLOPs.
MediumOpen - 02
Top-k Routing
The gating network that picks which experts see each token.
HardOpen - 03
Load Balancing Loss
Preventing expert collapse — keep every expert busy.
HardOpen - 04
Expert Parallelism
Distributing experts across GPUs at training time.
HardOpen