Table of Contents
Fetching ...

The Expert Strikes Back: Interpreting Mixture-of-Experts Language Models at Expert Level

Jeremy Herbst, Jae Hee Lee, Stefan Wermter

Abstract

Mixture-of-Experts (MoE) architectures have become the dominant choice for scaling Large Language Models (LLMs), activating only a subset of parameters per token. While MoE architectures are primarily adopted for computational efficiency, it remains an open question whether their sparsity makes them inherently easier to interpret than dense feed-forward networks (FFNs). We compare MoE experts and dense FFNs using $k$-sparse probing and find that expert neurons are consistently less polysemantic, with the gap widening as routing becomes sparser. This suggests that sparsity pressures both individual neurons and entire experts toward monosemanticity. Leveraging this finding, we zoom out from the neuron to the expert level as a more effective unit of analysis. We validate this approach by automatically interpreting hundreds of experts. This analysis allows us to resolve the debate on specialization: experts are neither broad domain specialists (e.g., biology) nor simple token-level processors. Instead, they function as fine-grained task experts, specializing in linguistic operations or semantic tasks (e.g., closing brackets in LaTeX). Our findings suggest that MoEs are inherently interpretable at the expert level, providing a clearer path toward large-scale model interpretability. Code is available at: https://github.com/jerryy33/MoE_analysis

The Expert Strikes Back: Interpreting Mixture-of-Experts Language Models at Expert Level

Abstract

Mixture-of-Experts (MoE) architectures have become the dominant choice for scaling Large Language Models (LLMs), activating only a subset of parameters per token. While MoE architectures are primarily adopted for computational efficiency, it remains an open question whether their sparsity makes them inherently easier to interpret than dense feed-forward networks (FFNs). We compare MoE experts and dense FFNs using -sparse probing and find that expert neurons are consistently less polysemantic, with the gap widening as routing becomes sparser. This suggests that sparsity pressures both individual neurons and entire experts toward monosemanticity. Leveraging this finding, we zoom out from the neuron to the expert level as a more effective unit of analysis. We validate this approach by automatically interpreting hundreds of experts. This analysis allows us to resolve the debate on specialization: experts are neither broad domain specialists (e.g., biology) nor simple token-level processors. Instead, they function as fine-grained task experts, specializing in linguistic operations or semantic tasks (e.g., closing brackets in LaTeX). Our findings suggest that MoEs are inherently interpretable at the expert level, providing a clearer path toward large-scale model interpretability. Code is available at: https://github.com/jerryy33/MoE_analysis

Paper Structure

This paper contains 52 sections, 13 equations, 18 figures, 8 tables.

Figures (18)

  • Figure 1: Best-layer F1 score for probes trained on MoE and dense models. Models are matched based on active parameter count, and if available from the same model family. Shaded regions represent 95% confidence intervals around the mean estimate over concepts at each $k$-value. Red lines represent dense models while blue lines represent MoE models. See \ref{['fig:poly_all_appendix']} in \ref{['app:probing-results']} for additional model comparisons.
  • Figure 2: Comparison of best-layer probes trained on MoE experts against probes trained on dense models. MoE models are on the y-axis and dense models are on the x-axis. Models are matched based on active parameter count, and if available from the same model family. See \ref{['fig:all_concepts_appendix']} in \ref{['app:probing-results']} for additional model comparisons.
  • Figure 3: Comparison of best-layer probes across the OLMo family. Shaded regions represent 95% confidence intervals around the mean estimate over concepts at each $k$-value.
  • Figure 4: Comparison of the Best-layer F1 score for different $N_\mathrm{A}/N$ ratios.
  • Figure 5: Automatic Interpretability F1 scores. (left) Distribution (right) Average per layer.
  • ...and 13 more figures