Collaborative Compression for Large-Scale MoE Deployment on Edge
Yixiao Chen, Yanyue Xie, Ruining Yang, Wei Jiang, Wei Wang, Yong He, Yue Chen, Pu Zhao, Yanzhi Wang
TL;DR
The paper tackles the memory bottleneck of ultra-large MoE models for edge deployment by proposing a collaborative compression framework that jointly employs Performance-Aware Expert Pruning, Hardware-Aware Activation Adjustment, and Mixed-Precision Quantization. This pipeline prunes underutilized experts, adapts activation routing, and performs tensor-level sensitivity-guided quantization upgrades within a memory budget, aided by a dynamic budgeting and back-off scheme. On DeepSeek-V3 (671B parameters), the approach reduces storage from 1.3TB to 103GB and delivers higher accuracy than uniform low-bit quantization across multiple benchmarks, including BBH, MMLU, and GSM8K, with feasible edge-device latency. The work enables practical deployment of MoE-scale models on devices with around 128GB of memory, illustrating both theoretical and real-world viability for edge AI.
Abstract
The Mixture of Experts (MoE) architecture is an important method for scaling Large Language Models (LLMs). It increases model capacity while keeping computation cost low. However, the ultra-large MoE models still have hundreds of billions of parameters, requiring massive memory/storage and leading to difficulties for deployment on resource-constrained edge platforms. Pruning or quantization alone can hardly address the issue, because of the super-aggressive compression ratio with significantly degraded accuracy and output quality. To facilitate the deployment of ultra-large MoEs on edge platforms, we propose a collaborative compression framework by combining expert pruning, mixed-precision quantization, and activation optimization. It can effectively reduce the storage footprint of the ultra-large MoE DeepSeek-V3 from 1.3TB to 103GB, while preserving high output quality with better accuracy than traditional uniform low-bit quantization methods. To the best of our knowledge, we are the first to deploy a compressed model from the ultra-large DeepSeek-V3 on the platform with a strict 128GB total memory limit. Our comprehensive experiments on multiple benchmarks under various memory constraints demonstrate the effectiveness of our method with smaller model sizes and higher accuracy than uniform low-bit quantization methods.
