Table of Contents
Fetching ...

ResMoE: Space-efficient Compression of Mixture of Experts LLMs via Residual Restoration

Mengting Ai, Tianxin Wei, Yifan Chen, Zhichen Zeng, Ritchie Zhao, Girish Varatkar, Bita Darvish Rouhani, Xianfeng Tang, Hanghang Tong, Jingrui He

TL;DR

This paper tackles the memory bottleneck of sparse MoE transformers by introducing ResMoE, a one-shot, data-agnostic compression framework that uses a Wasserstein barycenter to derive a common barycenter expert and then restores each original expert via compressed residuals. By aligning expert weight distributions through permutation matrices and modeling residuals with unstructured pruning or SVD, ResMoE achieves substantial parameter reductions (up to about 75%) with minimal accuracy degradation across diverse backbones such as Switch Transformer, Mixtral, and DeepSeekMoE, without retraining. The method is theoretically grounded in optimal transport, with a proposition linking the barycenter solution to an equivalent Frobenius-norm objective, and empirically validated on NLU and NLG benchmarks, including GLUE and zero-shot generation tasks. This approach broadens accessibility to large MoE LLMs by enabling efficient inference without retraining, while offering a flexible framework to combine barycenter extraction with residual compression and potential hardware-aware optimizations.

Abstract

Mixture-of-Experts (MoE) Transformer, the backbone architecture of multiple phenomenal language models, leverages sparsity by activating only a fraction of model parameters for each input token. The sparse structure, while allowing constant time costs, results in space inefficiency: we still need to load all the model parameters during inference. We introduce ResMoE, an innovative MoE approximation framework that utilizes Wasserstein barycenter to extract a common expert (barycenter expert) and approximate the residuals between this barycenter expert and the original ones. ResMoE enhances the space efficiency for inference of large-scale MoE Transformers in a one-shot and data-agnostic manner without retraining while maintaining minimal accuracy loss, thereby paving the way for broader accessibility to large language models. We demonstrate the effectiveness of ResMoE through extensive experiments on Switch Transformer, Mixtral, and DeepSeekMoE models. The results show that ResMoE can reduce the number of parameters in an expert by up to 75% while maintaining comparable performance. The code is available at https://github.com/iDEA-iSAIL-Lab-UIUC/ResMoE.

ResMoE: Space-efficient Compression of Mixture of Experts LLMs via Residual Restoration

TL;DR

This paper tackles the memory bottleneck of sparse MoE transformers by introducing ResMoE, a one-shot, data-agnostic compression framework that uses a Wasserstein barycenter to derive a common barycenter expert and then restores each original expert via compressed residuals. By aligning expert weight distributions through permutation matrices and modeling residuals with unstructured pruning or SVD, ResMoE achieves substantial parameter reductions (up to about 75%) with minimal accuracy degradation across diverse backbones such as Switch Transformer, Mixtral, and DeepSeekMoE, without retraining. The method is theoretically grounded in optimal transport, with a proposition linking the barycenter solution to an equivalent Frobenius-norm objective, and empirically validated on NLU and NLG benchmarks, including GLUE and zero-shot generation tasks. This approach broadens accessibility to large MoE LLMs by enabling efficient inference without retraining, while offering a flexible framework to combine barycenter extraction with residual compression and potential hardware-aware optimizations.

Abstract

Mixture-of-Experts (MoE) Transformer, the backbone architecture of multiple phenomenal language models, leverages sparsity by activating only a fraction of model parameters for each input token. The sparse structure, while allowing constant time costs, results in space inefficiency: we still need to load all the model parameters during inference. We introduce ResMoE, an innovative MoE approximation framework that utilizes Wasserstein barycenter to extract a common expert (barycenter expert) and approximate the residuals between this barycenter expert and the original ones. ResMoE enhances the space efficiency for inference of large-scale MoE Transformers in a one-shot and data-agnostic manner without retraining while maintaining minimal accuracy loss, thereby paving the way for broader accessibility to large language models. We demonstrate the effectiveness of ResMoE through extensive experiments on Switch Transformer, Mixtral, and DeepSeekMoE models. The results show that ResMoE can reduce the number of parameters in an expert by up to 75% while maintaining comparable performance. The code is available at https://github.com/iDEA-iSAIL-Lab-UIUC/ResMoE.

Paper Structure

This paper contains 31 sections, 2 theorems, 20 equations, 4 figures, 12 tables, 2 algorithms.

Key Result

Proposition 4.0

Consider the solution $\mathbf{W}_\omega$ to the following free-support WB problem Then $\mathbf{W}_\omega$, along with $\mathbf{T}_k = p_{\operatorname{I}} \cdot \operatorname{OT}\left(\mu_k, \mu_\omega(\mathbf{W}_\omega)\right)$, is the solution to the optimization problem eqn:min_frob.

Figures (4)

  • Figure 1: In this illustrative example of MoE layers, the Top- K Selector, along with the Gate Network--often referred to as the 'router'--selects Experts 1 and 3 based on their scores for the given input. Figure taken from ai2025mlpfusionefficientfinetuning.
  • Figure 2: The overall framework of ResMoE. We introduce permutation matrices $\mathbf{T}$ to obtain the barycenter expert $\mathbf{W_{\omega}}$ from a distributional view. Instead of compressing the original experts directly, we opt to compress the residual matrices ($\mathbf{\Delta}$, illustrated with lighter colors) between each expert and the barycenter expert, with the capability to dynamically and efficiently restore the original matrices during inference. We illustrate the concept using unstructured pruning as an example, with dashed orange lines indicating the pruned connections within the network.
  • Figure 3: Comparisons between ResMoE and baselines. Dash lines denote the connections or neurons are deleted. Expert Merging reduces the number of experts by consolidating several into one, while pruning is applied directly to the experts. In contrast, ResMoE compresses the residual and barycenter experts, with the input x directed to the restored experts.
  • Figure 4: Performance of selected baseline methods on Mixtral w.r.t. various compression rates on the LAMBADA dataset. Note that MEO and Git Re-Basin can only merge experts into at least one so they cannot reach the 10% compression rate.

Theorems & Definitions (2)

  • Proposition 4.0
  • Proposition C.0