Table of Contents
Fetching ...

Sparsity and Superposition in Mixture of Experts

Marmik Chaudhari, Jeremi Nuer, Rome Thorstenson

TL;DR

This work addresses the mechanistic interpretability of mixtures of experts (MoEs) by examining how network sparsity and routing influence feature representation. By extending a toy framework to MoEs, it introduces measures of feature capacity and monosemanticity, showing MoEs exhibit smoother transitions and reduced global interference compared to dense models, with greater monosemanticity as sparsity increases. A key contribution is a feature-based notion of expert specialization, demonstrated when initialization guides experts to monosemantic representations of coherent feature sets and occupy distinct regions of input space. The findings suggest that network sparsity can enable more interpretable MoEs without sacrificing performance in controlled settings, challenging the idea that interpretability and capability are inherently at odds, while acknowledging the need to validate these patterns in large-scale, realistic transformers.

Abstract

Mixture of Experts (MoE) models have become central to scaling large language models, yet their mechanistic differences from dense networks remain poorly understood. Previous work has explored how dense models use \textit{superposition} to represent more features than dimensions, and how superposition is a function of feature sparsity and feature importance. MoE models cannot be explained mechanistically through the same lens. We find that neither feature sparsity nor feature importance cause discontinuous phase changes, and that network sparsity (the ratio of active to total experts) better characterizes MoEs. We develop new metrics for measuring superposition across experts. Our findings demonstrate that models with greater network sparsity exhibit greater \emph{monosemanticity}. We propose a new definition of expert specialization based on monosemantic feature representation rather than load balancing, showing that experts naturally organize around coherent feature combinations when initialized appropriately. These results suggest that network sparsity in MoEs may enable more interpretable models without sacrificing performance, challenging the common assumption that interpretability and capability are fundamentally at odds.

Sparsity and Superposition in Mixture of Experts

TL;DR

This work addresses the mechanistic interpretability of mixtures of experts (MoEs) by examining how network sparsity and routing influence feature representation. By extending a toy framework to MoEs, it introduces measures of feature capacity and monosemanticity, showing MoEs exhibit smoother transitions and reduced global interference compared to dense models, with greater monosemanticity as sparsity increases. A key contribution is a feature-based notion of expert specialization, demonstrated when initialization guides experts to monosemantic representations of coherent feature sets and occupy distinct regions of input space. The findings suggest that network sparsity can enable more interpretable MoEs without sacrificing performance in controlled settings, challenging the idea that interpretability and capability are inherently at odds, while acknowledging the need to validate these patterns in large-scale, realistic transformers.

Abstract

Mixture of Experts (MoE) models have become central to scaling large language models, yet their mechanistic differences from dense networks remain poorly understood. Previous work has explored how dense models use \textit{superposition} to represent more features than dimensions, and how superposition is a function of feature sparsity and feature importance. MoE models cannot be explained mechanistically through the same lens. We find that neither feature sparsity nor feature importance cause discontinuous phase changes, and that network sparsity (the ratio of active to total experts) better characterizes MoEs. We develop new metrics for measuring superposition across experts. Our findings demonstrate that models with greater network sparsity exhibit greater \emph{monosemanticity}. We propose a new definition of expert specialization based on monosemantic feature representation rather than load balancing, showing that experts naturally organize around coherent feature combinations when initialized appropriately. These results suggest that network sparsity in MoEs may enable more interpretable models without sacrificing performance, challenging the common assumption that interpretability and capability are fundamentally at odds.

Paper Structure

This paper contains 16 sections, 4 equations, 8 figures, 1 table.

Figures (8)

  • Figure 1: Feature representation and superposition in a dense model with $n = 20$ features and $m = 6$ hidden dimensions, with importance $I = 0.7^i$ and uniform feature density $(1-S) = 0.1$. Superposition (color) is given by $\sum_{j} (\hat{W}_i \cdot W_j)^2$.
  • Figure 2: Feature representation and superposition in a MoE with $n = 20$ features, $3$ total experts, and $m = 2$ hidden dimensions per expert (top-$k = 1$ routing), with importance $I = 0.7^i$ and feature density $1-S = 0.1$.
  • Figure 3: Features per dimension versus inverse feature density ($\tfrac{1}{1-S}$) for dense and MoE architectures with uniform feature importance ($I_i = 1.0$). The dense model ($n=100$, $m=20$) has the most superposition, which decreases with increasing expert count: $4$ experts with $m=5$, $k=1$ (orange); $10$ experts with $m=2$, $k=2$ (green); $20$ experts with $m=1$, $k=5$ (red). All models have equal total parameters and similar $k/E$. The dashed line at $1.0$ marks monosemantic representation.
  • Figure 4: For a particular expert and input dimension (feature), we can decode how it is embedded in the hidden dimension—whether it is ignored (white), monosemantic (blue-purple), or superimposed (red). We plot joint feature norm ($||W_{n}||^2$) and superposition score ($\sum_{j < n} (\hat{W}_{n} \cdot W_j)^2$) across varying feature sparsity $S \in [0.1, 1]$ and relative last feature importance $I_n \in [0.1, 3]$, where the subscript $_{n}$ denotes the last feature of $n$ total features. For each cell, we train ten models and select the one with the lowest loss. We used load balancing loss in this section. We plot joint feature norm and superposition for the last feature: low L2 norm ($||W_{n}||$) is white, denoting the model is ignoring the last feature; otherwise a low superposition score is blue-purple to indicate monosemantic representation of the last feature. Red indicates the feature is represented in superposition. Cell $(i,j)$ in subfigure X.e/E denotes the expert e of E total experts trained on architecture X for last feature importance $I_n=i$ and sparsity $S=j$; X.1/1 indicates a dense model.
  • Figure 5: Expert feature norms $||W_i^{(e)}||$ and superposition (color) results for three different initialization schemes, with $n=20, m=5, E=4, S=0.1$. In (a), the gate matrix is initialized along the main diagonal ($W^r_i = \hat{e}_i$, the basis vector for that dimension), and relative feature importance decreases exponentially in order from feature one to 20. In (b), the gate matrix is initialized to an "ordered k-hot", such that the first expert aligns with the first five features, and each subsequent expert aligns with the next five features. Relative feature importance is the same as (a). In (c), the gate matrix is initialized to a "random k-hot", where each expert is assigned five random features such that experts share no common feature but cover all 20 features collectively. Relative feature importance decreases exponentially but is randomly distributed across features.
  • ...and 3 more figures