Table of Contents
Fetching ...

AIMER: Calibration-Free Task-Agnostic MoE Pruning

Zongfang Liu, Shengkun Tang, Yifan Shen, Huan Wang, Xin Yuan

Abstract

Mixture-of-Experts (MoE) language models increase parameter capacity without proportional per-token compute, but the deployment still requires storing all experts, making expert pruning important for reducing memory and serving overhead. Existing task-agnostic expert pruning methods are typically calibration-dependent: they estimate expert importance from routing or activation statistics on a calibration set, which makes pruning outcomes sensitive to the choice of calibration set and adds substantial preprocessing cost. We introduce AIMER (\textbf{A}bsolute mean over root mean square \textbf{IM}portance for \textbf{E}xpert \textbf{R}anking), a simple calibration-free criterion that yields clear within-layer score separation and distinct expert stratification. Across 7B to 30B MoE language models at 25\% and 50\% pruning ratios over 16 benchmarks, AIMER consistently delivers competitive or stronger overall performance against state-of-the-art calibration-based expert pruning baselines with only 0.22--1.27 seconds for scoring the experts.

AIMER: Calibration-Free Task-Agnostic MoE Pruning

Abstract

Mixture-of-Experts (MoE) language models increase parameter capacity without proportional per-token compute, but the deployment still requires storing all experts, making expert pruning important for reducing memory and serving overhead. Existing task-agnostic expert pruning methods are typically calibration-dependent: they estimate expert importance from routing or activation statistics on a calibration set, which makes pruning outcomes sensitive to the choice of calibration set and adds substantial preprocessing cost. We introduce AIMER (\textbf{A}bsolute mean over root mean square \textbf{IM}portance for \textbf{E}xpert \textbf{R}anking), a simple calibration-free criterion that yields clear within-layer score separation and distinct expert stratification. Across 7B to 30B MoE language models at 25\% and 50\% pruning ratios over 16 benchmarks, AIMER consistently delivers competitive or stronger overall performance against state-of-the-art calibration-based expert pruning baselines with only 0.22--1.27 seconds for scoring the experts.
Paper Structure (27 sections, 13 equations, 6 figures, 7 tables)

This paper contains 27 sections, 13 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: Sensitivity of REAP lasby2025reap to calibration set size on Qwen3-30B at 50% pruning ratio. We fix the calibration corpus to C4 allenai_c4_2024 and vary only the size of the calibration set from 0.5M to 2.1M tokens. The $x$-axis reports calibration tokens in millions (M = million tokens), and the $y$-axis reports performance change relative to the 0.5M-token setting (pp = percentage points). Half of the benchmarks show significant variation. Some benchmarks improve while others degrade, showing that performance is highly sensitive to calibration set size even with the same corpus.
  • Figure 2: Layer-wise Magnitude and AIMER score profiles across three MoE models. Columns show OLMoE-7B, ERNIE-21B, and Qwen3-30B. The top row uses Magnitude, and the bottom row uses AIMER. Within each layer, experts are ranked by the corresponding score, and scores are min-max rescaled to $[0,1]$. The $x$-axis reports within-layer expert rank, the $y$-axis reports layer index, and color indicates the rescaled score. Compared with Magnitude, AIMER yields a more separable distribution over experts, making the differences more distinguishable.
  • Figure 3: Radar plot of Qwen3-30B performance across all benchmarks at 50% pruning ratio. The dashed outline denotes the dense model, and each colored trace corresponds to one pruning method. Higher values indicate better task performance on the corresponding benchmark. Among the pruned models, AIMER encloses the largest area and has a capability profile closer to that of the full model. Additional radar plots for the other settings are provided in Appendix \ref{['sec:appendix_radar']}.
  • Figure 4: Radar plots of ERNIE-21B performance across all benchmarks. The left and right panels show 25% and 50% pruning ratios, respectively.
  • Figure 5: Radar plots of Qwen3-30B and OLMoE-7B performance across all benchmarks at 25% pruning. The left and right panels show Qwen3-30B and OLMoE-7B, respectively.
  • ...and 1 more figures