ResMoE: Space-efficient Compression of Mixture of Experts LLMs via Residual Restoration
Mengting Ai, Tianxin Wei, Yifan Chen, Zhichen Zeng, Ritchie Zhao, Girish Varatkar, Bita Darvish Rouhani, Xianfeng Tang, Hanghang Tong, Jingrui He
TL;DR
This paper tackles the memory bottleneck of sparse MoE transformers by introducing ResMoE, a one-shot, data-agnostic compression framework that uses a Wasserstein barycenter to derive a common barycenter expert and then restores each original expert via compressed residuals. By aligning expert weight distributions through permutation matrices and modeling residuals with unstructured pruning or SVD, ResMoE achieves substantial parameter reductions (up to about 75%) with minimal accuracy degradation across diverse backbones such as Switch Transformer, Mixtral, and DeepSeekMoE, without retraining. The method is theoretically grounded in optimal transport, with a proposition linking the barycenter solution to an equivalent Frobenius-norm objective, and empirically validated on NLU and NLG benchmarks, including GLUE and zero-shot generation tasks. This approach broadens accessibility to large MoE LLMs by enabling efficient inference without retraining, while offering a flexible framework to combine barycenter extraction with residual compression and potential hardware-aware optimizations.
Abstract
Mixture-of-Experts (MoE) Transformer, the backbone architecture of multiple phenomenal language models, leverages sparsity by activating only a fraction of model parameters for each input token. The sparse structure, while allowing constant time costs, results in space inefficiency: we still need to load all the model parameters during inference. We introduce ResMoE, an innovative MoE approximation framework that utilizes Wasserstein barycenter to extract a common expert (barycenter expert) and approximate the residuals between this barycenter expert and the original ones. ResMoE enhances the space efficiency for inference of large-scale MoE Transformers in a one-shot and data-agnostic manner without retraining while maintaining minimal accuracy loss, thereby paving the way for broader accessibility to large language models. We demonstrate the effectiveness of ResMoE through extensive experiments on Switch Transformer, Mixtral, and DeepSeekMoE models. The results show that ResMoE can reduce the number of parameters in an expert by up to 75% while maintaining comparable performance. The code is available at https://github.com/iDEA-iSAIL-Lab-UIUC/ResMoE.
