Table of Contents
Fetching ...

Grassmannian Mixture-of-Experts: Concentration-Controlled Routing on Subspace Manifolds

Ibne Farabi Shihab, Sanjeda Akter, Anuj Sharma

TL;DR

GrMoE, a routing framework that operates on the Grassmannian manifold of subspaces, where gating weights arise from the concentration parameters of Matrix Bingham distributions, establishes the first formal theory of concentration-controlled sparsity.

Abstract

Mixture-of-Experts models rely on learned routers to assign tokens to experts, yet standard softmax gating provides no principled mechanism to control the tradeoff between sparsity and utilization. We propose Grassmannian MoE (GrMoE), a routing framework that operates on the Grassmannian manifold of subspaces, where gating weights arise from the concentration parameters of Matrix Bingham distributions. This construction yields a single, interpretable knob -- the concentration matrix $Λ$ -- that continuously controls routing entropy, replacing discrete top-$k$ selection with a smooth, geometrically principled sparsity mechanism. We further develop an amortized variational inference procedure for posterior routing distributions, enabling uncertainty-aware expert assignment that naturally resists expert collapse. We formally prove tight bounds relating the Bingham concentration spectrum to routing entropy, expected top-$k$ mass, and an exponential bound on expert collapse, establishing the first formal theory of concentration-controlled sparsity. On synthetic routing tasks, a 350M-parameter MoE language model with 8 experts, a 1.3B-parameter model with 16 experts, and a 2.7B-parameter model with 32 experts, GrMoE achieves 0\% routing collapse across all seeds, comparable or better perplexity with 15--30\% improved load balance, and a smooth monotonic relationship between concentration and effective sparsity that enables post-hoc sparsity tuning without retraining. Token-level analysis reveals that experts learn heterogeneous concentration values that correlate with linguistic specialization, providing interpretable routing behavior.

Grassmannian Mixture-of-Experts: Concentration-Controlled Routing on Subspace Manifolds

TL;DR

GrMoE, a routing framework that operates on the Grassmannian manifold of subspaces, where gating weights arise from the concentration parameters of Matrix Bingham distributions, establishes the first formal theory of concentration-controlled sparsity.

Abstract

Mixture-of-Experts models rely on learned routers to assign tokens to experts, yet standard softmax gating provides no principled mechanism to control the tradeoff between sparsity and utilization. We propose Grassmannian MoE (GrMoE), a routing framework that operates on the Grassmannian manifold of subspaces, where gating weights arise from the concentration parameters of Matrix Bingham distributions. This construction yields a single, interpretable knob -- the concentration matrix -- that continuously controls routing entropy, replacing discrete top- selection with a smooth, geometrically principled sparsity mechanism. We further develop an amortized variational inference procedure for posterior routing distributions, enabling uncertainty-aware expert assignment that naturally resists expert collapse. We formally prove tight bounds relating the Bingham concentration spectrum to routing entropy, expected top- mass, and an exponential bound on expert collapse, establishing the first formal theory of concentration-controlled sparsity. On synthetic routing tasks, a 350M-parameter MoE language model with 8 experts, a 1.3B-parameter model with 16 experts, and a 2.7B-parameter model with 32 experts, GrMoE achieves 0\% routing collapse across all seeds, comparable or better perplexity with 15--30\% improved load balance, and a smooth monotonic relationship between concentration and effective sparsity that enables post-hoc sparsity tuning without retraining. Token-level analysis reveals that experts learn heterogeneous concentration values that correlate with linguistic specialization, providing interpretable routing behavior.
Paper Structure (43 sections, 3 theorems, 23 equations, 4 figures, 6 tables, 1 algorithm)

This paper contains 43 sections, 3 theorems, 23 equations, 4 figures, 6 tables, 1 algorithm.

Key Result

Theorem 1

Let $g^{(\alpha)}(x)$ be the GrMoE routing distribution (Eq. eq:alpha_sparsity) with $N$ experts. Let $H(\alpha, x) = -\sum_e g_e^{(\alpha)}(x) \log g_e^{(\alpha)}(x)$ be the routing entropy for token $x$. Then $H(\alpha, x)$ is strictly bounded by the concentration gap $\Delta_\kappa(x) = \max_e \k Furthermore, the entropy reduction is exactly bounded from above globally by the concentration vari

Figures (4)

  • Figure 1: Post-hoc sparsity control via concentration scaling. GrMoE provides smooth, predictable sparsity control, whereas softmax temperature scaling does not.
  • Figure 2: Training dynamics comparison. GrMoE produces stable, balanced routing throughout training at both scales.
  • Figure 3: Token-level routing analysis on the 1.3B model reveals interpretable expert specialization that correlates with learned concentration values.
  • Figure 4: Pareto frontier of PPL vs. throughput. A single GrMoE model (varying $\alpha$) traces a smooth frontier that dominates separately trained softmax top-$k$ models at every operating point.

Theorems & Definitions (6)

  • Theorem 1: Concentration--Entropy Bound
  • Corollary 1: Top-$k$ Mass Control
  • Theorem 2: Collapse Resistance via Subspace Separation
  • proof
  • proof
  • proof