Table of Contents
Fetching ...

Unveiling Super Experts in Mixture-of-Experts Large Language Models

Zunhai Su, Qingyuan Li, Hao Zhang, Weihao Ye, Qibo Xue, YuLei Qian, Yuchen Xie, Ngai Wong, Kehong Yuan

TL;DR

This work reveals a previously underappreciated phenomenon in mixture-of-experts LLMs: a vanishingly small set of Super Experts (SEs) that drive massive activations and are essential for maintaining model performance, especially in reasoning tasks. Through systematic discovery, profiling, and cross-model analyses, SEs are shown to be model-specific, data-agnostic in distribution, and robust to post-training changes. Pruning SEs triggers dramatic degradation across non-reasoning and reasoning benchmarks, including near-zero Pass@1 on Math and GPQA tasks, and even causes repetitive outputs, highlighting SEs as a bottleneck in current MoE LLMs. The authors connect SEs to the origin of systematic outliers and attention sinks in Transformers, demonstrating that compressing SEs disrupts these dynamics and offering mechanistic insights to guide future expert-aware compression and training strategies.

Abstract

In this study, we report, for the first time, the discovery and systematic investigation of a distinct subset of experts that play a pivotal role in the MoE LLMs' forward inference. These experts are prevalent in open-source MoE LLMs, and despite their extremely limited number, pruning them results in a substantial decline in model performance (e.g., prune just three out of 6,144 causes Qwen3-30B-A3B to generate repetitive and uninformative outputs).We refer to these experts as Super Experts (SEs). Our comprehensive analysis provides progressively deeper insights into SEs: (i) SEs are characterized by rare but extreme activation outliers in the output of the down_proj, which give rise to massive activations in the hidden states between decoder layers. Moreover, the distribution of SEs is model-specific, data-agnostic, and remains unaffected by post-training processes. (ii) By pruning SEs, we assess their significance across a variety of tasks, revealing their considerable impact on the model's overall performance, particularly in mathematical reasoning. (iii) We further investigate why compressing SEs exerts such a pronounced impact. We show that, in MoE LLMs, SEs serve as the primary source of the systematic outlier mechanism in Transformers, and that compressing them profoundly disrupts this process, ultimately causing the collapse of attention sinks. These findings advance the understanding of the internal dynamics of MoE LLMs, filling an important gap in the current knowledge. The code is provided in https://github.com/ZunhaiSu/Super-Experts-Profilling.

Unveiling Super Experts in Mixture-of-Experts Large Language Models

TL;DR

This work reveals a previously underappreciated phenomenon in mixture-of-experts LLMs: a vanishingly small set of Super Experts (SEs) that drive massive activations and are essential for maintaining model performance, especially in reasoning tasks. Through systematic discovery, profiling, and cross-model analyses, SEs are shown to be model-specific, data-agnostic in distribution, and robust to post-training changes. Pruning SEs triggers dramatic degradation across non-reasoning and reasoning benchmarks, including near-zero Pass@1 on Math and GPQA tasks, and even causes repetitive outputs, highlighting SEs as a bottleneck in current MoE LLMs. The authors connect SEs to the origin of systematic outliers and attention sinks in Transformers, demonstrating that compressing SEs disrupts these dynamics and offering mechanistic insights to guide future expert-aware compression and training strategies.

Abstract

In this study, we report, for the first time, the discovery and systematic investigation of a distinct subset of experts that play a pivotal role in the MoE LLMs' forward inference. These experts are prevalent in open-source MoE LLMs, and despite their extremely limited number, pruning them results in a substantial decline in model performance (e.g., prune just three out of 6,144 causes Qwen3-30B-A3B to generate repetitive and uninformative outputs).We refer to these experts as Super Experts (SEs). Our comprehensive analysis provides progressively deeper insights into SEs: (i) SEs are characterized by rare but extreme activation outliers in the output of the down_proj, which give rise to massive activations in the hidden states between decoder layers. Moreover, the distribution of SEs is model-specific, data-agnostic, and remains unaffected by post-training processes. (ii) By pruning SEs, we assess their significance across a variety of tasks, revealing their considerable impact on the model's overall performance, particularly in mathematical reasoning. (iii) We further investigate why compressing SEs exerts such a pronounced impact. We show that, in MoE LLMs, SEs serve as the primary source of the systematic outlier mechanism in Transformers, and that compressing them profoundly disrupts this process, ultimately causing the collapse of attention sinks. These findings advance the understanding of the internal dynamics of MoE LLMs, filling an important gap in the current knowledge. The code is provided in https://github.com/ZunhaiSu/Super-Experts-Profilling.

Paper Structure

This paper contains 26 sections, 9 equations, 20 figures, 14 tables, 1 algorithm.

Figures (20)

  • Figure 1: Analysis of experts pruning on Qwen3-30B-A3B using the WikiText-2 dataset. Pruning three Super Experts results in a significant degradation in Perplexity (PPL).
  • Figure 2: Decoder Architecture of MoE LLM.
  • Figure 3: SEs mechanism in Qwen3-30B-A3B. The line plots depict the maximum output magnitudes of down_proj for experts 68/92/82 across layers. Massive activation is gradually amplified through expert 68 in layer 1, expert 92 in layer 2, and expert 82 in layer 3. Extreme activation outliers from these SEs are propagated into the hidden states between decoders via residual summation, progressively leading to massive activation.
  • Figure 4: Impact of SEs pruning on MAs in Qwen3-30B-A3B. MAs are computed using 100 input samples from the C4 raffel2020exploring dataset, each with a length of 2K.
  • Figure 5: Heatmap visualizations of the maximum output magnitudes from the down_proj for each expert across layers. SEs are highlighted with arrows.
  • ...and 15 more figures