SEER-MoE: Sparse Expert Efficiency through Regularization for Mixture-of-Experts

Alexandre Muzio; Alex Sun; Churan He

SEER-MoE: Sparse Expert Efficiency through Regularization for Mixture-of-Experts

Alexandre Muzio, Alex Sun, Churan He

TL;DR

SEER-MoE targets the memory and compute bottlenecks of Mixture-of-Experts models with a two-stage approach: first prune the total number of experts using heavy-hitters counting, then fine-tune with entropy-regularized Top-K adaptation to further reduce activated experts. The method provides concrete parameter/FLOP reductions (e.g., ~25–27% reductions) with minimal accuracy loss, demonstrated on Mixtral 8x7b across MMLU and SST5 benchmarks, and shows superiority over prior pruning baselines. The work advances MoE deployment by combining principled sparsification with regularized fine-tuning, offering practical guidance for resource-constrained inference and broader applicability to other MoE architectures.

Abstract

The advancement of deep learning has led to the emergence of Mixture-of-Experts (MoEs) models, known for their dynamic allocation of computational resources based on input. Despite their promise, MoEs face challenges, particularly in terms of memory requirements. To address this, our work introduces SEER-MoE, a novel two-stage framework for reducing both the memory footprint and compute requirements of pre-trained MoE models. The first stage involves pruning the total number of experts using a heavy-hitters counting guidance, while the second stage employs a regularization-based fine-tuning strategy to recover accuracy loss and reduce the number of activated experts during inference. Our empirical studies demonstrate the effectiveness of our method, resulting in a sparse MoEs model optimized for inference efficiency with minimal accuracy trade-offs.

SEER-MoE: Sparse Expert Efficiency through Regularization for Mixture-of-Experts

TL;DR

Abstract

Paper Structure (21 sections, 8 equations, 4 figures, 5 tables)

This paper contains 21 sections, 8 equations, 4 figures, 5 tables.

Introduction
Related Work
Methodology
Parameter and Compute Scaling of MoE Transformers
Expert Sparsification with Heavy-hitters Counting
Enhancing Expert Efficiency: Advanced Finetuning Strategies
Top-K adaptation
Entropy-based gating regularization
SEER-MoE: Sparse Expert Efficiency through Regularization for Mixture-of-Experts
Experiments
Experimental details
Data
Evaluation Method
Results and Analysis
Expert Sparsification with Heavy-hitters Counting
...and 6 more sections

Figures (4)

Figure 1: SEER-MoE visualized in a two-stage process. (a) The initial model with all experts and top-2 router. (b) Stage 1 involves expert pruning based on heavy-hitters counting to identify and retain the most critical experts; Stage 2 includes top-K adaptation through fine-tuning to optimize the number of active experts, culminating in a model that balances efficiency and performance. (c) SEER-MoE with expert pruning and top-K adaptation.
Figure 2: Heavy Hitters Counting Heatmap with Mixtral 8x7b on MMLU.
Figure 3: Comparison of fine-tuning strategies and their impact on SST5 accuracy.
Figure 3: Full stage approach results on SST5.

SEER-MoE: Sparse Expert Efficiency through Regularization for Mixture-of-Experts

TL;DR

Abstract

SEER-MoE: Sparse Expert Efficiency through Regularization for Mixture-of-Experts

Authors

TL;DR

Abstract

Table of Contents

Figures (4)