Table of Contents
Fetching ...

MoNE: Replacing Redundant Experts with Lightweight Novices for Structured Pruning of MoE

Geng Zhang, Yuxuan Han, Yuxuan Lou, Yiqi Zhang, Wangbo Zhao, Yang You

TL;DR

MoNE tackles the memory overhead of deploying large MoE models by pruning redundant experts and replacing them with lightweight novices. It introduces two redundancy metrics—expert access frequency and output variance—to identify prune candidates and uses unbiased mean outputs to construct novices that preserve performance. Across multiple MoE architectures, calibration data sources, and sample sizes, MoNE consistently outperforms baseline structured pruning methods, achieving up to 2.72 points higher average zero-shot accuracy at 25% pruning and minimal drops for large models. The results indicate MoNE's robustness and scalability, with practical implications for memory-efficient deployment of large MoE-based systems.

Abstract

Mixture-of-Experts (MoE) enables efficient scaling of large language models by activating only a subset of experts per input token. However, deploying MoE-based models incurs significant memory overhead due to the need to retain all experts in memory. While structured pruning is promising to reduce memory costs, existing methods often show suboptimal performance and unstable degradation in three dimensions: model architectures, calibration data sources, and calibration sample sizes. This paper proposes Mixture-of-Novices-and-Experts (MoNE), a novel expert pruning method that replaces redundant experts with lightweight novices to achieve effective and robust model compression. MoNE evaluates expert redundancy based on two metrics: access frequency and output variance. Experts exhibiting low usage and stable outputs are pruned and replaced with lightweight novices-unbiased estimations of their original outputs-minimizing performance degradation. Extensive experiments demonstrate that MoNE consistently outperforms baseline methods with minimal accuracy degradation across the three dimensions, confirming its effectiveness and robustness. Notably, it outperforms baselines by up to 2.72 for the average zero shot accuracy across nine downstream tasks under 25% pruning ratio, with only 0.14 performance drop for Qwen2-57B-A14B. The code is available at https://github.com/zxgx/mode-pd.

MoNE: Replacing Redundant Experts with Lightweight Novices for Structured Pruning of MoE

TL;DR

MoNE tackles the memory overhead of deploying large MoE models by pruning redundant experts and replacing them with lightweight novices. It introduces two redundancy metrics—expert access frequency and output variance—to identify prune candidates and uses unbiased mean outputs to construct novices that preserve performance. Across multiple MoE architectures, calibration data sources, and sample sizes, MoNE consistently outperforms baseline structured pruning methods, achieving up to 2.72 points higher average zero-shot accuracy at 25% pruning and minimal drops for large models. The results indicate MoNE's robustness and scalability, with practical implications for memory-efficient deployment of large MoE-based systems.

Abstract

Mixture-of-Experts (MoE) enables efficient scaling of large language models by activating only a subset of experts per input token. However, deploying MoE-based models incurs significant memory overhead due to the need to retain all experts in memory. While structured pruning is promising to reduce memory costs, existing methods often show suboptimal performance and unstable degradation in three dimensions: model architectures, calibration data sources, and calibration sample sizes. This paper proposes Mixture-of-Novices-and-Experts (MoNE), a novel expert pruning method that replaces redundant experts with lightweight novices to achieve effective and robust model compression. MoNE evaluates expert redundancy based on two metrics: access frequency and output variance. Experts exhibiting low usage and stable outputs are pruned and replaced with lightweight novices-unbiased estimations of their original outputs-minimizing performance degradation. Extensive experiments demonstrate that MoNE consistently outperforms baseline methods with minimal accuracy degradation across the three dimensions, confirming its effectiveness and robustness. Notably, it outperforms baselines by up to 2.72 for the average zero shot accuracy across nine downstream tasks under 25% pruning ratio, with only 0.14 performance drop for Qwen2-57B-A14B. The code is available at https://github.com/zxgx/mode-pd.

Paper Structure

This paper contains 28 sections, 7 equations, 13 figures, 16 tables.

Figures (13)

  • Figure 1: (a) Different structured pruning methods. (b) Layer-wise normalized expert access frequency and output variance of Deepseek-V2-Lite for three downstream tasks. Experts with high access frequency or output variances are the same across downstream tasks. Expert in blue circles has both high frequency and variance. Expert in red circles only has high variance. Expert in green circles only has high frequency. Similar observations on other models and tasks are in Appendix \ref{['redundant experts']}.
  • Figure 2: The overview of MoNE. Given an MoE model, it first exploits a calibration dataset to evaluate the expert access frequency and output variance. Then, the two metrics are fused to get the expert redundancy. Finally, the novices are derived from the averaged outputs for redundant experts.
  • Figure 3: Average accuracy versus accuracy drop variance. MoNE advances the Pareto frontier across varying model architectures, calibration data sources and calibration sample sizes.
  • Figure 4: Ablation study on expert access frequency, output variance and novice replacement. Numbers are the difference to the proposed MoNE. The detailed result is provided in Appendix \ref{['ablation study detail']}.
  • Figure 5: Layer-wise normalized expert access frequency and output variance of OLMoE for Arc-C & Arc-E, MMLU and Winogrande.
  • ...and 8 more figures