Grassmannian Mixture-of-Experts: Concentration-Controlled Routing on Subspace Manifolds

Ibne Farabi Shihab; Sanjeda Akter; Anuj Sharma

Grassmannian Mixture-of-Experts: Concentration-Controlled Routing on Subspace Manifolds

Ibne Farabi Shihab, Sanjeda Akter, Anuj Sharma

TL;DR

GrMoE, a routing framework that operates on the Grassmannian manifold of subspaces, where gating weights arise from the concentration parameters of Matrix Bingham distributions, establishes the first formal theory of concentration-controlled sparsity.

Abstract

Mixture-of-Experts models rely on learned routers to assign tokens to experts, yet standard softmax gating provides no principled mechanism to control the tradeoff between sparsity and utilization. We propose Grassmannian MoE (GrMoE), a routing framework that operates on the Grassmannian manifold of subspaces, where gating weights arise from the concentration parameters of Matrix Bingham distributions. This construction yields a single, interpretable knob -- the concentration matrix $Λ$ -- that continuously controls routing entropy, replacing discrete top-$k$ selection with a smooth, geometrically principled sparsity mechanism. We further develop an amortized variational inference procedure for posterior routing distributions, enabling uncertainty-aware expert assignment that naturally resists expert collapse. We formally prove tight bounds relating the Bingham concentration spectrum to routing entropy, expected top-$k$ mass, and an exponential bound on expert collapse, establishing the first formal theory of concentration-controlled sparsity. On synthetic routing tasks, a 350M-parameter MoE language model with 8 experts, a 1.3B-parameter model with 16 experts, and a 2.7B-parameter model with 32 experts, GrMoE achieves 0\% routing collapse across all seeds, comparable or better perplexity with 15--30\% improved load balance, and a smooth monotonic relationship between concentration and effective sparsity that enables post-hoc sparsity tuning without retraining. Token-level analysis reveals that experts learn heterogeneous concentration values that correlate with linguistic specialization, providing interpretable routing behavior.

Grassmannian Mixture-of-Experts: Concentration-Controlled Routing on Subspace Manifolds

TL;DR

Abstract

-- that continuously controls routing entropy, replacing discrete top-

selection with a smooth, geometrically principled sparsity mechanism. We further develop an amortized variational inference procedure for posterior routing distributions, enabling uncertainty-aware expert assignment that naturally resists expert collapse. We formally prove tight bounds relating the Bingham concentration spectrum to routing entropy, expected top-

mass, and an exponential bound on expert collapse, establishing the first formal theory of concentration-controlled sparsity. On synthetic routing tasks, a 350M-parameter MoE language model with 8 experts, a 1.3B-parameter model with 16 experts, and a 2.7B-parameter model with 32 experts, GrMoE achieves 0\% routing collapse across all seeds, comparable or better perplexity with 15--30\% improved load balance, and a smooth monotonic relationship between concentration and effective sparsity that enables post-hoc sparsity tuning without retraining. Token-level analysis reveals that experts learn heterogeneous concentration values that correlate with linguistic specialization, providing interpretable routing behavior.

Paper Structure (43 sections, 3 theorems, 23 equations, 4 figures, 6 tables, 1 algorithm)

This paper contains 43 sections, 3 theorems, 23 equations, 4 figures, 6 tables, 1 algorithm.

Introduction
Background and Notation
Mixture-of-Experts Routing
The Grassmannian Manifold
The Matrix Bingham Distribution
Method: Grassmannian MoE Routing
Expert Subspace Representation
Bingham Concentration Gating
Continuous Sparsity via Concentration
Connection to Softmax Gating
Theoretical Analysis: Concentration and Sparsity
Amortized Variational Routing
Routing as Exact Bayesian Inference
Amortized Posterior with Learned Uncertainty
Training Objective
...and 28 more sections

Key Result

Theorem 1

Let $g^{(\alpha)}(x)$ be the GrMoE routing distribution (Eq. eq:alpha_sparsity) with $N$ experts. Let $H(\alpha, x) = -\sum_e g_e^{(\alpha)}(x) \log g_e^{(\alpha)}(x)$ be the routing entropy for token $x$. Then $H(\alpha, x)$ is strictly bounded by the concentration gap $\Delta_\kappa(x) = \max_e \k Furthermore, the entropy reduction is exactly bounded from above globally by the concentration vari

Figures (4)

Figure 1: Post-hoc sparsity control via concentration scaling. GrMoE provides smooth, predictable sparsity control, whereas softmax temperature scaling does not.
Figure 2: Training dynamics comparison. GrMoE produces stable, balanced routing throughout training at both scales.
Figure 3: Token-level routing analysis on the 1.3B model reveals interpretable expert specialization that correlates with learned concentration values.
Figure 4: Pareto frontier of PPL vs. throughput. A single GrMoE model (varying $\alpha$) traces a smooth frontier that dominates separately trained softmax top-$k$ models at every operating point.

Theorems & Definitions (6)

Theorem 1: Concentration--Entropy Bound
Corollary 1: Top-$k$ Mass Control
Theorem 2: Collapse Resistance via Subspace Separation
proof
proof
proof

Grassmannian Mixture-of-Experts: Concentration-Controlled Routing on Subspace Manifolds

TL;DR

Abstract

Grassmannian Mixture-of-Experts: Concentration-Controlled Routing on Subspace Manifolds

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (6)