Table of Contents
Fetching ...

CAMERA: Multi-Matrix Joint Compression for MoE Models via Micro-Expert Redundancy Analysis

Yuzhuang Xu, Xu Han, Yuanchi Zhang, Yixuan Wang, Yijun Liu, Shiyu Ji, Qingfu Zhu, Wanxiang Che

TL;DR

MoE-based LLMs suffer substantial compute and storage overhead with diminishing returns as the number of experts grows. The paper reframes MoE layers as mixtures of micro-experts spanning the up, gate, and down projections, and introduces CAMERA as a training-free framework to identify micro-expert redundancy. It then presents Camera-P for cross-matrix structured pruning and Camera-Q for micro-expert–aware mixed-precision quantization, delivering strong gains across nine downstream tasks and enabling ultra-fast micro-expert analysis on large models. Together, these methods yield a scalable, efficient approach to compress MoE models without sacrificing performance, facilitating deployment of very large sparse transformers on single-GPU hardware and beyond.

Abstract

Large Language Models (LLMs) with Mixture-of-Experts (MoE) architectures are distinguished by their strong performance scaling with increasing parameters across a wide range of tasks, yet they also suffer from substantial computational and storage overheads. Notably, the performance gains of MoE models do not scale proportionally with the growth in expert parameters. While prior works attempt to reduce parameters via expert-level pruning, merging, or decomposition, they still suffer from challenges in both performance and computational efficiency. In this paper, we address these challenges by introducing micro-expert as a finer-grained compression unit that spans across matrices. We first establish a more fundamental perspective, viewing MoE layers as mixtures of micro-experts, and present CAMERA, a lightweight and training-free framework for identifying micro-expert redundancy. Our analysis uncovers significant variance in micro-expert contributions during decoding. Based on this insight, we further propose CAMERA-P, a structured micro-expert pruning framework, and CAMERA-Q, a mixed-precision quantization idea designed for micro-experts. Extensive experiments on nine downstream tasks show that CAMERA-P consistently outperforms strong baselines under pruning ratios ranging from 20% to 60%. Furthermore, CAMERA-Q achieves superior results under aggressive 2-bit quantization, surpassing existing matrix- and channel-level ideas. Notably, our method enables complete micro-expert analysis of Qwen2-57B-A14B in less than 5 minutes on a single NVIDIA A100-40GB GPU.

CAMERA: Multi-Matrix Joint Compression for MoE Models via Micro-Expert Redundancy Analysis

TL;DR

MoE-based LLMs suffer substantial compute and storage overhead with diminishing returns as the number of experts grows. The paper reframes MoE layers as mixtures of micro-experts spanning the up, gate, and down projections, and introduces CAMERA as a training-free framework to identify micro-expert redundancy. It then presents Camera-P for cross-matrix structured pruning and Camera-Q for micro-expert–aware mixed-precision quantization, delivering strong gains across nine downstream tasks and enabling ultra-fast micro-expert analysis on large models. Together, these methods yield a scalable, efficient approach to compress MoE models without sacrificing performance, facilitating deployment of very large sparse transformers on single-GPU hardware and beyond.

Abstract

Large Language Models (LLMs) with Mixture-of-Experts (MoE) architectures are distinguished by their strong performance scaling with increasing parameters across a wide range of tasks, yet they also suffer from substantial computational and storage overheads. Notably, the performance gains of MoE models do not scale proportionally with the growth in expert parameters. While prior works attempt to reduce parameters via expert-level pruning, merging, or decomposition, they still suffer from challenges in both performance and computational efficiency. In this paper, we address these challenges by introducing micro-expert as a finer-grained compression unit that spans across matrices. We first establish a more fundamental perspective, viewing MoE layers as mixtures of micro-experts, and present CAMERA, a lightweight and training-free framework for identifying micro-expert redundancy. Our analysis uncovers significant variance in micro-expert contributions during decoding. Based on this insight, we further propose CAMERA-P, a structured micro-expert pruning framework, and CAMERA-Q, a mixed-precision quantization idea designed for micro-experts. Extensive experiments on nine downstream tasks show that CAMERA-P consistently outperforms strong baselines under pruning ratios ranging from 20% to 60%. Furthermore, CAMERA-Q achieves superior results under aggressive 2-bit quantization, surpassing existing matrix- and channel-level ideas. Notably, our method enables complete micro-expert analysis of Qwen2-57B-A14B in less than 5 minutes on a single NVIDIA A100-40GB GPU.

Paper Structure

This paper contains 46 sections, 30 equations, 9 figures, 8 tables, 3 algorithms.

Figures (9)

  • Figure 1: Transition from Experts to Micro-Experts. The lower part illustrates the structure of the mixture of micro-experts and the corresponding pruning strategy.
  • Figure 2: Distribution of micro-experts within each expert based on global ranking from Camera ($\lambda=20\%$), layer 12 of Deepseek-MoE-16B. We list all 66 experts, where 'S0/S1' denotes the shared experts, and the rest are non-shared experts.
  • Figure 3: Pruning ratios across selected experts, taken from layer 12 of Deepseek-MoE-16B, with $\lambda = 40\%$.
  • Figure 4: Task performance with varying $\alpha$ when $\lambda=20\%$. The scores are scaled to highlight the differences.
  • Figure 5: Matrix calculation flow of Camera-P, Camera-Q and Camera-Q$^\dagger$. For simplicity, we omit the matrix $\mathbf{W^\mathrm{gate}}$. The red dashed box on the left indicates the weight of a micro-expert. In Camera-Q and Camera-Q$^\dagger$, we use light and dark colors to indicate the lower and higher bit-width of the weights.
  • ...and 4 more figures