Table of Contents
Fetching ...

Path-Constrained Mixture-of-Experts

Zijin Gu, Tatiana Likhomanenko, Vimal Thilak, Jason Ramapuram, Navdeep Jaitly

Abstract

Sparse Mixture-of-Experts (MoE) architectures enable efficient scaling by activating only a subset of parameters for each input. However, conventional MoE routing selects each layer's experts independently, creating N^L possible expert paths -- for N experts across L layers. This far exceeds typical training set sizes, leading to statistical inefficiency as the model may not learn meaningful structure over such a vast path space. To constrain it, we propose \pathmoe, which shares router parameters across consecutive layers. Experiments on 0.9B and 16B parameter models demonstrate consistent improvements on perplexity and downstream tasks over independent routing, while eliminating the need for auxiliary load balancing losses. Analysis reveals that tokens following the same path naturally cluster by linguistic function, with \pathmoe{} producing more concentrated groups, better cross-layer consistency, and greater robustness to routing perturbations. These results offer a new perspective for understanding MoE architectures through the lens of expert paths.

Path-Constrained Mixture-of-Experts

Abstract

Sparse Mixture-of-Experts (MoE) architectures enable efficient scaling by activating only a subset of parameters for each input. However, conventional MoE routing selects each layer's experts independently, creating N^L possible expert paths -- for N experts across L layers. This far exceeds typical training set sizes, leading to statistical inefficiency as the model may not learn meaningful structure over such a vast path space. To constrain it, we propose \pathmoe, which shares router parameters across consecutive layers. Experiments on 0.9B and 16B parameter models demonstrate consistent improvements on perplexity and downstream tasks over independent routing, while eliminating the need for auxiliary load balancing losses. Analysis reveals that tokens following the same path naturally cluster by linguistic function, with \pathmoe{} producing more concentrated groups, better cross-layer consistency, and greater robustness to routing perturbations. These results offer a new perspective for understanding MoE architectures through the lens of expert paths.
Paper Structure (49 sections, 14 equations, 8 figures, 9 tables)

This paper contains 49 sections, 14 equations, 8 figures, 9 tables.

Figures (8)

  • Figure 1: Statistical inefficiency of independent routing.
  • Figure 2: Spectrum of routing constraints in MoE architectures. (a) Independent routing: each layer has its own router $r_i$, creating $N^L$ possible paths for $N$ experts and $L$ layers. (b) Block-wise parameter sharing routing, dubbed PathMoE: layers within a block share router parameters. (c) Fully decision shared routing: all layers share one router and its decision.
  • Figure 3: Training dynamics comparing PathB4-MoE and Indep-MoE with and without load balancing losses. Curves are smoothed with exponential moving average (weight=0.6) for clarity.
  • Figure 4: Cumulative token coverage as a function of the number of paths.
  • Figure 5: Token specialization of representative expert paths. Each word cloud shows the most frequent tokens processed by a path specialized for that linguistic category.
  • ...and 3 more figures