Table of Contents
Fetching ...

Efficient Expert Pruning for Sparse Mixture-of-Experts Language Models: Enhancing Performance and Reducing Inference Costs

Enshu Liu, Junyi Zhu, Zinan Lin, Xuefei Ning, Matthew B. Blaschko, Shengen Yan, Guohao Dai, Huazhong Yang, Yu Wang

TL;DR

This work tackles the deployment bottlenecks of sparse Mixture-of-Experts language models by introducing EEP, a gradient-free evolutionary pruning strategy that reduces both the total number of experts and the number of active experts without gradient updates. EEP operates in two phases—expert pruning and expert merging—via two learned matrices that reconfigure router mappings and merge expert parameters, enabling substantial sparsity while preserving or even improving task performance on downstream benchmarks. The approach demonstrates that pruning up to 75% of experts can dramatically cut memory and latency, with notable gains on tasks like SQuAD, and that fewer experts can sometimes yield better task-specific results without fine-tuning. These results suggest a practical, inference-friendly pathway to deploy SMoE LLMs more broadly, with generalization across models and datasets, albeit at the cost of an exploratory search process.

Abstract

The rapid advancement of large language models (LLMs) has led to architectures with billions to trillions of parameters, posing significant deployment challenges due to their substantial demands on memory, processing power, and energy consumption. Sparse Mixture-of-Experts (SMoE) architectures have emerged as a solution, activating only a subset of parameters per token, thereby achieving faster inference while maintaining performance. However, SMoE models still face limitations in broader deployment due to their large parameter counts and significant GPU memory requirements. In this work, we introduce a gradient-free evolutionary strategy named EEP (Efficient Expert P}runing) to enhance the pruning of experts in SMoE models. EEP relies solely on model inference (i.e., no gradient computation) and achieves greater sparsity while maintaining or even improving performance on downstream tasks. EEP can be used to reduce both the total number of experts (thus saving GPU memory) and the number of active experts (thus accelerating inference). For example, we demonstrate that pruning up to 75% of experts in Mixtral $8\times7$B-Instruct results in a substantial reduction in parameters with minimal performance loss. Remarkably, we observe improved performance on certain tasks, such as a significant increase in accuracy on the SQuAD dataset (from 53.4% to 75.4%), when pruning half of the experts. With these results, EEP not only lowers the barrier to deploying SMoE models,but also challenges the conventional understanding of model pruning by showing that fewer experts can lead to better task-specific performance without any fine-tuning. Code is available at https://github.com/imagination-research/EEP.

Efficient Expert Pruning for Sparse Mixture-of-Experts Language Models: Enhancing Performance and Reducing Inference Costs

TL;DR

This work tackles the deployment bottlenecks of sparse Mixture-of-Experts language models by introducing EEP, a gradient-free evolutionary pruning strategy that reduces both the total number of experts and the number of active experts without gradient updates. EEP operates in two phases—expert pruning and expert merging—via two learned matrices that reconfigure router mappings and merge expert parameters, enabling substantial sparsity while preserving or even improving task performance on downstream benchmarks. The approach demonstrates that pruning up to 75% of experts can dramatically cut memory and latency, with notable gains on tasks like SQuAD, and that fewer experts can sometimes yield better task-specific results without fine-tuning. These results suggest a practical, inference-friendly pathway to deploy SMoE LLMs more broadly, with generalization across models and datasets, albeit at the cost of an exploratory search process.

Abstract

The rapid advancement of large language models (LLMs) has led to architectures with billions to trillions of parameters, posing significant deployment challenges due to their substantial demands on memory, processing power, and energy consumption. Sparse Mixture-of-Experts (SMoE) architectures have emerged as a solution, activating only a subset of parameters per token, thereby achieving faster inference while maintaining performance. However, SMoE models still face limitations in broader deployment due to their large parameter counts and significant GPU memory requirements. In this work, we introduce a gradient-free evolutionary strategy named EEP (Efficient Expert P}runing) to enhance the pruning of experts in SMoE models. EEP relies solely on model inference (i.e., no gradient computation) and achieves greater sparsity while maintaining or even improving performance on downstream tasks. EEP can be used to reduce both the total number of experts (thus saving GPU memory) and the number of active experts (thus accelerating inference). For example, we demonstrate that pruning up to 75% of experts in Mixtral B-Instruct results in a substantial reduction in parameters with minimal performance loss. Remarkably, we observe improved performance on certain tasks, such as a significant increase in accuracy on the SQuAD dataset (from 53.4% to 75.4%), when pruning half of the experts. With these results, EEP not only lowers the barrier to deploying SMoE models,but also challenges the conventional understanding of model pruning by showing that fewer experts can lead to better task-specific performance without any fine-tuning. Code is available at https://github.com/imagination-research/EEP.
Paper Structure (32 sections, 6 equations, 11 figures, 14 tables)

This paper contains 32 sections, 6 equations, 11 figures, 14 tables.

Figures (11)

  • Figure 1: (a) the original SMoE block and (b) our implementation of EEP. We introduce the expert merging matrix $\bm W_{\text{EM}}$, and the router mapping matrix $\bm W_{\text{RM}}$, to enable the search for the optimal pruning configuration. When $\bm W_{\text{EM}}$ and $\bm W_{\text{RM}}$ have one-hot vectors as their rows, pruning is performed. When their elements are continuous values, routing weights and experts are aggregated to generate new weights and experts. We use an evolutionary strategy to search for the optimal $\bm W_{\text{EM}}$ and $\bm W_{\text{RM}}$.
  • Figure 2: We leverage EEP for two purposes: reducing the total number of experts, which lowers the memory footprint (use case 1), and reducing the number of active experts, thereby accelerating inference (use case 2).
  • Figure 3: Performance from a single expert to an ensemble of experts.
  • Figure 4: Statistics of the expert activation patterns before and after the Expert Pruning Phase. The data represents the first transformer block of Mixtral $8\times 7$B-Instruct on the SQuAD dataset. In (a), four retained experts are re-indexed from 0 to 3 for clarity.
  • Figure 5: Accuracy-Iteration curves on different datasets. The model is Mixtral $8\times 7$B and the total number of expert is 4.
  • ...and 6 more figures