Table of Contents
Fetching ...

MoE Lens -- An Expert Is All You Need

Marmik Chaudhari, Idhant Gulati, Nishkal Hundia, Pranav Karra, Shivam Raval

TL;DR

A systematic analysis of expert specialization in MoEs through two complementary approaches: domain-specific routing patterns and an early decoding framework that tracks expert contributions to output representations indicates concentrated expertise highlighting potential opportunities for inference optimization through targeted expert pruning while maintaining model performance and opening avenues towards studying localization of learned knowledge in these models.

Abstract

Mixture of Experts (MoE) models enable parameter-efficient scaling through sparse expert activations, yet optimizing their inference and memory costs remains challenging due to limited understanding of their specialization behavior. We present a systematic analysis of expert specialization in MoEs through two complementary approaches: domain-specific routing patterns and an early decoding framework that tracks expert contributions to output representations. Our analysis of the DeepSeekMoE model reveals that despite having 64 routed experts with 6 active for each layer's computation, the model predominantly relies on a few specialized experts, with the top-weighted expert's output closely approximating the full ensemble prediction. We quantitatively validate these findings through a systematic analysis of the token routing distribution, demonstrating that very few experts handle over 50\% of routing decisions across different specialized domains. Hidden state similarity between single and ensemble experts for every layer is extremely high, with some layers having cosine similarity as high as 0.95 and perplexity increasing by only 5\% when using a single expert across all three domains. Our results indicate that Mixture of Experts models exhibit concentrated expertise highlighting potential opportunities for inference optimization through targeted expert pruning while maintaining model performance and opening avenues towards studying localization of learned knowledge in these models.

MoE Lens -- An Expert Is All You Need

TL;DR

A systematic analysis of expert specialization in MoEs through two complementary approaches: domain-specific routing patterns and an early decoding framework that tracks expert contributions to output representations indicates concentrated expertise highlighting potential opportunities for inference optimization through targeted expert pruning while maintaining model performance and opening avenues towards studying localization of learned knowledge in these models.

Abstract

Mixture of Experts (MoE) models enable parameter-efficient scaling through sparse expert activations, yet optimizing their inference and memory costs remains challenging due to limited understanding of their specialization behavior. We present a systematic analysis of expert specialization in MoEs through two complementary approaches: domain-specific routing patterns and an early decoding framework that tracks expert contributions to output representations. Our analysis of the DeepSeekMoE model reveals that despite having 64 routed experts with 6 active for each layer's computation, the model predominantly relies on a few specialized experts, with the top-weighted expert's output closely approximating the full ensemble prediction. We quantitatively validate these findings through a systematic analysis of the token routing distribution, demonstrating that very few experts handle over 50\% of routing decisions across different specialized domains. Hidden state similarity between single and ensemble experts for every layer is extremely high, with some layers having cosine similarity as high as 0.95 and perplexity increasing by only 5\% when using a single expert across all three domains. Our results indicate that Mixture of Experts models exhibit concentrated expertise highlighting potential opportunities for inference optimization through targeted expert pruning while maintaining model performance and opening avenues towards studying localization of learned knowledge in these models.
Paper Structure (11 sections, 4 equations, 11 figures)

This paper contains 11 sections, 4 equations, 11 figures.

Figures (11)

  • Figure 1: Expert Specialization in DeepSeekMoE. We visualize the distribution of tokens that are routed to an expert for our English, French-QA, and GSM8K datasets. The y-axis shows the routing percentage per expert, with the red dashed line indicating a uniform routing baseline ($\approx$ 9.4%). See Appendix \ref{['subsec:expert-specialization']} for extended plots for other models and layers.
  • Figure 2: An example of early decoding using LogitLens for DeepSeekMoE on an example input: "When datasets are sufficiently large, increasing the capacity (number of parameters) of neural networks can give much better prediction". Each cell shows the top-1 token prediction after the final token "these" across layers (rows) for layer output, routed experts + residual stream for various top-$k$ values. Color intensity indicates prediction confidence. The expert index is denoted by the lower-left subscript number and the top-right superscript indicates expert weight. See Appendix \ref{['subsec:logit-lens']} for other domains.
  • Figure 3: (Left) Normalized log perplexity across different values of top-k experts for various domains for next-token prediction task. (Right) Cosine similarity between the hidden states of $\mathbf{H_{t}^{\ell_1}}$ and $\mathbf{H_{t}^{\ell_6}}$ across all 27 layers shows consistently high alignment.
  • Figure 4: Expert Specialization (Part 1) of DeepSeekMoE for various layers. We visualize how frequently tokens from different domains are routed to the 64 experts using top-$k=6$ routing. The y-axis shows routing percentage per expert, with the red dashed line indicating uniform routing baseline ($\approx$ 9.4%).
  • Figure 5: Expert Specialization (Part 2) of DeepSeekMoE for various layers. We visualize how frequently tokens from different domains are routed to the 64 experts using top-$k=6$ routing. The y-axis shows routing percentage per expert, with the red dashed line indicating uniform routing baseline ($\approx$ 9.4%).
  • ...and 6 more figures