Table of Contents
Fetching ...

Mixture Compressor for Mixture-of-Experts LLMs Gains More

Wei Huang, Yue Liao, Jianhui Liu, Ruifei He, Haoru Tan, Shiming Zhang, Hongsheng Li, Si Liu, Xiaojuan Qi

TL;DR

This work introduces MC, a training-free Mixture-Compressor for MoE-LLMs, combining Pre-Loading Mixed-Precision Quantization (PMQ) and Online Dynamic Pruning (ODP) to aggressively compress expert parameters and dynamically select activated experts. PMQ uses an LP-based, expert-significance–aware bit-width allocation grounded in access frequency and activation weights, while ODP prunes low-confidence experts at inference with token-protection to avoid attention decay. The approach achieves extreme compression (e.g., ~76.6% of parameters) with minimal accuracy loss (as low as ~3.8%) and significant speedups (up to ~1.8x), outperforming several baselines at ultra-low bit-widths. Extensive experiments on Mixtral models demonstrate practical deployment benefits, including memory reductions and hardware-efficient inference, without requiring training.

Abstract

Mixture-of-Experts large language models (MoE-LLMs) marks a significant step forward of language models, however, they encounter two critical challenges in practice: 1) expert parameters lead to considerable memory consumption and loading latency; and 2) the current activated experts are redundant, as many tokens may only require a single expert. Motivated by these issues, we investigate the MoE-LLMs and make two key observations: a) different experts exhibit varying behaviors on activation reconstruction error, routing scores, and activated frequencies, highlighting their differing importance, and b) not all tokens are equally important -- only a small subset is critical. Building on these insights, we propose MC, a training-free Mixture-Compressor for MoE-LLMs, which leverages the significance of both experts and tokens to achieve an extreme compression. First, to mitigate storage and loading overheads, we introduce Pre-Loading Mixed-Precision Quantization, which formulates the adaptive bit-width allocation as a Linear Programming problem, where the objective function balances multi-factors reflecting the importance of each expert. Additionally, we develop Online Dynamic Pruning, which identifies important tokens to retain and dynamically select activated experts for other tokens during inference to optimize efficiency while maintaining performance. Our MC integrates static quantization and dynamic pruning to collaboratively achieve extreme compression for MoE-LLMs with less accuracy loss, ensuring an optimal trade-off between performance and efficiency. Extensive experiments confirm the effectiveness of our approach. For instance, at 2.54 bits, MC compresses 76.6% of the model, with only a 3.8% average accuracy loss. During dynamic inference, we further reduce activated parameters by 15%, with a performance drop of less than 0.6%.

Mixture Compressor for Mixture-of-Experts LLMs Gains More

TL;DR

This work introduces MC, a training-free Mixture-Compressor for MoE-LLMs, combining Pre-Loading Mixed-Precision Quantization (PMQ) and Online Dynamic Pruning (ODP) to aggressively compress expert parameters and dynamically select activated experts. PMQ uses an LP-based, expert-significance–aware bit-width allocation grounded in access frequency and activation weights, while ODP prunes low-confidence experts at inference with token-protection to avoid attention decay. The approach achieves extreme compression (e.g., ~76.6% of parameters) with minimal accuracy loss (as low as ~3.8%) and significant speedups (up to ~1.8x), outperforming several baselines at ultra-low bit-widths. Extensive experiments on Mixtral models demonstrate practical deployment benefits, including memory reductions and hardware-efficient inference, without requiring training.

Abstract

Mixture-of-Experts large language models (MoE-LLMs) marks a significant step forward of language models, however, they encounter two critical challenges in practice: 1) expert parameters lead to considerable memory consumption and loading latency; and 2) the current activated experts are redundant, as many tokens may only require a single expert. Motivated by these issues, we investigate the MoE-LLMs and make two key observations: a) different experts exhibit varying behaviors on activation reconstruction error, routing scores, and activated frequencies, highlighting their differing importance, and b) not all tokens are equally important -- only a small subset is critical. Building on these insights, we propose MC, a training-free Mixture-Compressor for MoE-LLMs, which leverages the significance of both experts and tokens to achieve an extreme compression. First, to mitigate storage and loading overheads, we introduce Pre-Loading Mixed-Precision Quantization, which formulates the adaptive bit-width allocation as a Linear Programming problem, where the objective function balances multi-factors reflecting the importance of each expert. Additionally, we develop Online Dynamic Pruning, which identifies important tokens to retain and dynamically select activated experts for other tokens during inference to optimize efficiency while maintaining performance. Our MC integrates static quantization and dynamic pruning to collaboratively achieve extreme compression for MoE-LLMs with less accuracy loss, ensuring an optimal trade-off between performance and efficiency. Extensive experiments confirm the effectiveness of our approach. For instance, at 2.54 bits, MC compresses 76.6% of the model, with only a 3.8% average accuracy loss. During dynamic inference, we further reduce activated parameters by 15%, with a performance drop of less than 0.6%.
Paper Structure (24 sections, 10 equations, 12 figures, 12 tables)

This paper contains 24 sections, 10 equations, 12 figures, 12 tables.

Figures (12)

  • Figure 1: (a) MMLU (5-shot$\uparrow$) accuracy across different open-source LLMs with various activated parameters (dot-lines denote the quantized models, solid-lines are 16-bit models). To align quantized models' parameter size with 16-bit models, we define 16bits as one parameter (e.g. 8$\times$2-bit elements represent one parameter). (b) Comparison of total parameter size and inference activated parameter size on few open-source LLMs and compressed Mixtral 8$\times$7b.
  • Figure 2: The overview of our proposed MC pipeline with two stages compression for experts. (a) Framework of pre-loading static mixed-precision quantization (PMQ) of MoE-LLMs. PMQ determins the activated feature and loss sensitivity of all experts and plans the optimal precision configuration under ultra-low -bit-width. (b) Schematic of online dynamic mixture pruning (ODP) of MoE-LLMs. ODP utilizes significant token protection mechanism with weigh-guided experts pruning, which only need to keep 2% token to successfully safeguard the MoE performance.
  • Figure 3: Distribution of expert drop F-norm (red), activated weights (green) and frequencies (blue) in the Mixtral $8\times7$b model, encompassing 32 MoE layers with 8 experts per layer. The top set of the heatmap is calculated through C4 dataset, and the bottom set is calculated through MATH dataset. MoE-LLMs selectively activate top$\hbox{-}$2 experts in each MoE layer, wherein a significant portion of experts remain less important or inactivated all the time.
  • Figure 4: Typical attention map of block 15, head 4 in Mixtral $8\times7$b under different dynamic pruning process. The middle with out pruning shows that attention, in a column-wise manner, highlighted several tokens with high scores, such as token 31 and token 67. However, after undergoing traditional weight-only pruning through block 14 layers, experts pruned at the position of token 67, resulting in a decay in the attention map. Through attention-aware pruning based on token importance, block 14 protected token 67, thereby avoiding attention decay in the subsequent layer.
  • Figure 5: Quantized PPL performance of Mixtral $8\times7$b under different mixed-precision strategies (with random allocation)
  • ...and 7 more figures