Table of Contents
Fetching ...

HEAPr: Hessian-based Efficient Atomic Expert Pruning in Output Space

Ke Li, Zheng Yang, Zhongbin Zhou, Feng Xue, Zhonglin Jiang, Wenxiao Wang

TL;DR

HEAPr is introduced, a novel pruning algorithm that decomposes experts into smaller, indivisible atomic experts, enabling more precise and flexible atomic expert pruning and outperforms existing expert-level pruning methods across a wide range of pruning ratios and benchmarks.

Abstract

Mixture-of-Experts (MoE) architectures in large language models (LLMs) deliver exceptional performance and reduced inference costs compared to dense LLMs. However, their large parameter counts result in prohibitive memory requirements, limiting practical deployment. While existing pruning methods primarily focus on expert-level pruning, this coarse granularity often leads to substantial accuracy degradation. In this work, we introduce HEAPr, a novel pruning algorithm that decomposes experts into smaller, indivisible atomic experts, enabling more precise and flexible atomic expert pruning. To measure the importance of each atomic expert, we leverage second-order information based on principles similar to the Optimal Brain Surgeon theory. To address the computational and storage challenges posed by second-order information, HEAPr exploits the inherent properties of atomic experts to transform the second-order information from expert parameters into that of atomic expert parameters, and further simplifies it to the second-order information of atomic expert outputs. This approach reduces the space complexity from $O(d^4)$, where $d$ is the model's dimensionality, to $O(d^2)$. HEAPr requires only two forward passes and one backward pass on a small calibration set to compute the importance of atomic experts. Extensive experiments on MoE models, including DeepSeek MoE and Qwen MoE family, demonstrate that HEAPr outperforms existing expert-level pruning methods across a wide range of pruning ratios and benchmarks. Specifically, HEAPr achieves nearly lossless compression at pruning ratios of 20% ~ 25% in most models, while also reducing FLOPs nearly by 20%. The code can be found at [https://github.com/LLIKKE/HEAPr](https://github.com/LLIKKE/HEAPr).

HEAPr: Hessian-based Efficient Atomic Expert Pruning in Output Space

TL;DR

HEAPr is introduced, a novel pruning algorithm that decomposes experts into smaller, indivisible atomic experts, enabling more precise and flexible atomic expert pruning and outperforms existing expert-level pruning methods across a wide range of pruning ratios and benchmarks.

Abstract

Mixture-of-Experts (MoE) architectures in large language models (LLMs) deliver exceptional performance and reduced inference costs compared to dense LLMs. However, their large parameter counts result in prohibitive memory requirements, limiting practical deployment. While existing pruning methods primarily focus on expert-level pruning, this coarse granularity often leads to substantial accuracy degradation. In this work, we introduce HEAPr, a novel pruning algorithm that decomposes experts into smaller, indivisible atomic experts, enabling more precise and flexible atomic expert pruning. To measure the importance of each atomic expert, we leverage second-order information based on principles similar to the Optimal Brain Surgeon theory. To address the computational and storage challenges posed by second-order information, HEAPr exploits the inherent properties of atomic experts to transform the second-order information from expert parameters into that of atomic expert parameters, and further simplifies it to the second-order information of atomic expert outputs. This approach reduces the space complexity from , where is the model's dimensionality, to . HEAPr requires only two forward passes and one backward pass on a small calibration set to compute the importance of atomic experts. Extensive experiments on MoE models, including DeepSeek MoE and Qwen MoE family, demonstrate that HEAPr outperforms existing expert-level pruning methods across a wide range of pruning ratios and benchmarks. Specifically, HEAPr achieves nearly lossless compression at pruning ratios of 20% ~ 25% in most models, while also reducing FLOPs nearly by 20%. The code can be found at [https://github.com/LLIKKE/HEAPr](https://github.com/LLIKKE/HEAPr).

Paper Structure

This paper contains 34 sections, 25 equations, 6 figures, 5 tables, 1 algorithm.

Figures (6)

  • Figure 1: Illustration of atomic expert-level pruning, which removes the $t$-th column from the $W_{gate}$ and $W_{up}$ matrices, and the corresponding $t$-th row from the $W_{down}$ matrix.
  • Figure 2: Performance of DeepSeekMoE-16B-Base under varying compression ratios, with corresponding FLOPs saving on WikiText2 data.
  • Figure 3: Consistency between atomic expert normalized importance score $s_k$ and the change in loss. The figure plots the actual loss increase $\Delta \ell$ observed upon pruning atomic experts within 10% quantile bins (ordered by original expert index) against the cumulative importance score $s_k$.
  • Figure 4: Performance of DeepSeekMoE-16B-Base under a 20% compression ratio, using calibration data randomly sampled from WikiText-2 and C4.
  • Figure 5: Compression ratios across different layers under 25% global pruning for Qwen1.5-MoE-A2.7B-Chat, DeepSeekMoE-16b-Base, and Qwen3-30B-A3B.
  • ...and 1 more figures