NOUS
DashboardCoursesUploadAuthoringAnalyticsStudentsSettings
RK
Prof. Ramesh KumarPES University
All Courses
4 lessons

Mixture of Experts

Sparse activation — the next axis of scale

Lessons

  1. 01

    MoE Fundamentals

    Why sparse activation lets you scale parameters without scaling FLOPs.

    MediumOpen
  2. 02

    Top-k Routing

    The gating network that picks which experts see each token.

    HardOpen
  3. 03

    Load Balancing Loss

    Preventing expert collapse — keep every expert busy.

    HardOpen
  4. 04

    Expert Parallelism

    Distributing experts across GPUs at training time.

    HardOpen