SEER-MoE: Sparse Expert Efficiency through Regularization for Mixture-of-Experts
Alexandre Muzio, Alex Sun, Churan He
TL;DR
SEER-MoE targets the memory and compute bottlenecks of Mixture-of-Experts models with a two-stage approach: first prune the total number of experts using heavy-hitters counting, then fine-tune with entropy-regularized Top-K adaptation to further reduce activated experts. The method provides concrete parameter/FLOP reductions (e.g., ~25–27% reductions) with minimal accuracy loss, demonstrated on Mixtral 8x7b across MMLU and SST5 benchmarks, and shows superiority over prior pruning baselines. The work advances MoE deployment by combining principled sparsification with regularized fine-tuning, offering practical guidance for resource-constrained inference and broader applicability to other MoE architectures.
Abstract
The advancement of deep learning has led to the emergence of Mixture-of-Experts (MoEs) models, known for their dynamic allocation of computational resources based on input. Despite their promise, MoEs face challenges, particularly in terms of memory requirements. To address this, our work introduces SEER-MoE, a novel two-stage framework for reducing both the memory footprint and compute requirements of pre-trained MoE models. The first stage involves pruning the total number of experts using a heavy-hitters counting guidance, while the second stage employs a regularization-based fine-tuning strategy to recover accuracy loss and reduce the number of activated experts during inference. Our empirical studies demonstrate the effectiveness of our method, resulting in a sparse MoEs model optimized for inference efficiency with minimal accuracy trade-offs.
