Table of Contents
Fetching ...

MC#: Mixture Compressor for Mixture-of-Experts Large Models

Wei Huang, Yue Liao, Yukang Chen, Jianhui Liu, Haoru Tan, Si Liu, Shiming Zhang, Shuicheng Yan, Xiaojuan Qi

TL;DR

This work tackles the memory and computation bottlenecks of Mixture-of-Experts architectures by introducing MC#, a two-stage compression framework that combines Pre-Loading Mixed-Precision Quantization (PMQ) with Online Top-any Pruning (OTP). PMQ assigns per-expert bit-widths via an LP-based optimization over expert significance and quantization error, enabling ultra-low-bit static compression, while OTP uses a differentiable, Gumbel-Softmax-based mechanism to prune experts dynamically per token, reducing runtime cost. The approach yields a Pareto-optimal trade-off between size and performance, achieving up to a 6.2x weight reduction at an average of $2.57$ bits with only $1.7\%$ accuracy loss on DeepSeek-VL2 across multimodal benchmarks, and reducing expert activation by more than $20\%$ with less than $1\%$ performance degradation. Together, PMQ and OTP enable highly compressed MoE-based large models that can still outperform equal-sized, full-precision baselines on several benchmarks, highlighting practical potential for efficient deployment in diverse hardware.

Abstract

Mixture-of-Experts (MoE) effectively scales large language models (LLMs) and vision-language models (VLMs) by increasing capacity through sparse activation. However, preloading all experts into memory and activating multiple experts per input introduces significant computational and memory overhead, making the expert module a major contributor to model size and inference cost. To address this, we propose MC# (Mixture-Compressor-sharp), a framework that combines static quantization and dynamic expert pruning by leveraging the significance of experts and tokens for aggressive compression of MoE-LLMs/VLMs. To reduce storage and loading costs, we introduce Pre-Loading Mixed-Precision Quantization (PMQ), which optimizes bit allocation via linear programming, balancing expert importance and quantization error for a Pareto-optimal trade-off between size and performance. To reduce runtime computation, Online Top-any Pruning (OTP) uses Gumbel-Softmax sampling to dynamically select a subset of experts per token, enabling fine-grained control over activation. By combining PMQ's static bit-width optimization with OTP's dynamic routing, MC# achieves extreme compression with minimal accuracy loss. On DeepSeek-VL2, MC# achieves a 6.2 times weight reduction at 2.57 average bits with only a 1.7% accuracy drop across five multimodal benchmarks. Additionally, OTP reduces expert activation over 20% with less than 1% performance degradation, demonstrating strong potential for efficient MoE-based model deployment.

MC#: Mixture Compressor for Mixture-of-Experts Large Models

TL;DR

This work tackles the memory and computation bottlenecks of Mixture-of-Experts architectures by introducing MC#, a two-stage compression framework that combines Pre-Loading Mixed-Precision Quantization (PMQ) with Online Top-any Pruning (OTP). PMQ assigns per-expert bit-widths via an LP-based optimization over expert significance and quantization error, enabling ultra-low-bit static compression, while OTP uses a differentiable, Gumbel-Softmax-based mechanism to prune experts dynamically per token, reducing runtime cost. The approach yields a Pareto-optimal trade-off between size and performance, achieving up to a 6.2x weight reduction at an average of bits with only accuracy loss on DeepSeek-VL2 across multimodal benchmarks, and reducing expert activation by more than with less than performance degradation. Together, PMQ and OTP enable highly compressed MoE-based large models that can still outperform equal-sized, full-precision baselines on several benchmarks, highlighting practical potential for efficient deployment in diverse hardware.

Abstract

Mixture-of-Experts (MoE) effectively scales large language models (LLMs) and vision-language models (VLMs) by increasing capacity through sparse activation. However, preloading all experts into memory and activating multiple experts per input introduces significant computational and memory overhead, making the expert module a major contributor to model size and inference cost. To address this, we propose MC# (Mixture-Compressor-sharp), a framework that combines static quantization and dynamic expert pruning by leveraging the significance of experts and tokens for aggressive compression of MoE-LLMs/VLMs. To reduce storage and loading costs, we introduce Pre-Loading Mixed-Precision Quantization (PMQ), which optimizes bit allocation via linear programming, balancing expert importance and quantization error for a Pareto-optimal trade-off between size and performance. To reduce runtime computation, Online Top-any Pruning (OTP) uses Gumbel-Softmax sampling to dynamically select a subset of experts per token, enabling fine-grained control over activation. By combining PMQ's static bit-width optimization with OTP's dynamic routing, MC# achieves extreme compression with minimal accuracy loss. On DeepSeek-VL2, MC# achieves a 6.2 times weight reduction at 2.57 average bits with only a 1.7% accuracy drop across five multimodal benchmarks. Additionally, OTP reduces expert activation over 20% with less than 1% performance degradation, demonstrating strong potential for efficient MoE-based model deployment.

Paper Structure

This paper contains 28 sections, 14 equations, 13 figures, 8 tables.

Figures (13)

  • Figure 1: Comparison of total parameter size and inference activated parameter size on a few open-source large vision/language models and compressed Mixtral 8$\times$7b (MoE-LLMs) and DeepSeek-VL2-L (MoE-VLMs). L: large.
  • Figure 2: (a) MMLU (5-shot$\uparrow$) accuracy across different open-source LLMs with various activated parameters (dot-lines denote the quantized models, solid-lines are 16-bit models). To align quantized models' parameter size with 16-bit models, we define 16bits as one standard parameter (e.g., 8$\times$2-bit elements represent one parameter). (b) Average performance($\uparrow$) on 5 general multimodal benchmarks across different open-source VLMs with various activated parameters. L: large, S: small. T: tiny.
  • Figure 3: The overview of our proposed MC pipeline with two-stage compression for experts. (a) Framework of pre-loading static mixed-precision quantization (PMQ) of MoE-LLMs. PMQ determines the activated feature and loss sensitivity of all experts and plans the optimal precision configuration under ultra-low-bit-width. (b) Schematic of online top-any pruning (OTP) of MoE-LLMs. OTP utilizes a learnable experts pruning scheme to achieve higher inference efficiency.
  • Figure 4: Distribution of expert drop F-norm (red), activated weights (green) and frequencies (blue) in the Mixtral $8\times7$b model, encompassing 32 MoE layers with 8 experts per layer. The top set of the heatmap is calculated through C4 dataset raffel2020exploring, and the bottom set is calculated through MATH dataset. MoE-LLMs selectively activate top$\hbox{-}$2 experts in each MoE layer, wherein a significant portion of experts remain less important or inactivated all the time.
  • Figure 5: Comparison of experts quantization loss and activations between MoE-LLM and MoE-VLM. The left panel illustrates the quantization loss and the distribution of expert activation features for Mixtral $8\times7$b calibrated on a subset of C4 dataset raffel2020exploring, while the right panel presents the corresponding metrics for DeepSeek-VL2-S calibrated on a subset of the M4 dataset li2024llava. The expert indices are arranged in a clockwise manner, covering experts 0-8 and 0-64, respectively. Notably, the quantization loss and activation feature distributions across different experts in MoE-VLMs are significantly more imbalanced compared to those in MoE-LLMs.
  • ...and 8 more figures