Table of Contents
Fetching ...

Finding Fantastic Experts in MoEs: A Unified Study for Expert Dropping Strategies and Observations

Ajay Jaiswal, Jianyu Wang, Yixiao Li, Pingzhi Li, Tianlong Chen, Zhangyang Wang, Chong Wang, Ruoming Pang, Xianzhi Du

TL;DR

The paper addresses the memory and redundancy challenges of sparsely activated Mixture-of-Experts (SMoEs) by proposing MC-Suite, a diverse set of criteria to gauge expert importance across weight, activation, gradient, and inference signals. It advocates an iterative estimate-prune-finetune approach called MoE Lottery Subnetworks, leveraging task-agnostic finetuning to stabilize load distribution and recover performance after pruning. A key finding is that activation and gradient entropy-based criteria most effectively identify least-dominant experts, and that instruction-following capabilities, while hurt by dropping, can be restored with external demonstrations or supervised fine-tuning. The results demonstrate that substantial sparsity (e.g., 50–75%) can be achieved with modest performance loss on knowledge-intensive tasks when coupled with few-shot or supervised instruction-following augmentation, highlighting practical pathways for deploying larger SMoEs under tight resource constraints.

Abstract

Sparsely activated Mixture-of-Experts (SMoE) has shown promise in scaling up the learning capacity of neural networks. However, vanilla SMoEs have issues such as expert redundancy and heavy memory requirements, making them inefficient and non-scalable, especially for resource-constrained scenarios. Expert-level sparsification of SMoEs involves pruning the least important experts to address these limitations. In this work, we aim to address three questions: (1) What is the best recipe to identify the least knowledgeable subset of experts that can be dropped with minimal impact on performance? (2) How should we perform expert dropping (one-shot or iterative), and what correction measures can we undertake to minimize its drastic impact on SMoE subnetwork capabilities? (3) What capabilities of full-SMoEs are severely impacted by the removal of the least dominant experts, and how can we recover them? Firstly, we propose MoE Experts Compression Suite (MC-Suite), which is a collection of some previously explored and multiple novel recipes to provide a comprehensive benchmark for estimating expert importance from diverse perspectives, as well as unveil numerous valuable insights for SMoE experts. Secondly, unlike prior works with a one-shot expert pruning approach, we explore the benefits of iterative pruning with the re-estimation of the MC-Suite criterion. Moreover, we introduce the benefits of task-agnostic fine-tuning as a correction mechanism during iterative expert dropping, which we term MoE Lottery Subnetworks. Lastly, we present an experimentally validated conjecture that, during expert dropping, SMoEs' instruction-following capabilities are predominantly hurt, which can be restored to a robust level subject to external augmentation of instruction-following capabilities using k-shot examples and supervised fine-tuning.

Finding Fantastic Experts in MoEs: A Unified Study for Expert Dropping Strategies and Observations

TL;DR

The paper addresses the memory and redundancy challenges of sparsely activated Mixture-of-Experts (SMoEs) by proposing MC-Suite, a diverse set of criteria to gauge expert importance across weight, activation, gradient, and inference signals. It advocates an iterative estimate-prune-finetune approach called MoE Lottery Subnetworks, leveraging task-agnostic finetuning to stabilize load distribution and recover performance after pruning. A key finding is that activation and gradient entropy-based criteria most effectively identify least-dominant experts, and that instruction-following capabilities, while hurt by dropping, can be restored with external demonstrations or supervised fine-tuning. The results demonstrate that substantial sparsity (e.g., 50–75%) can be achieved with modest performance loss on knowledge-intensive tasks when coupled with few-shot or supervised instruction-following augmentation, highlighting practical pathways for deploying larger SMoEs under tight resource constraints.

Abstract

Sparsely activated Mixture-of-Experts (SMoE) has shown promise in scaling up the learning capacity of neural networks. However, vanilla SMoEs have issues such as expert redundancy and heavy memory requirements, making them inefficient and non-scalable, especially for resource-constrained scenarios. Expert-level sparsification of SMoEs involves pruning the least important experts to address these limitations. In this work, we aim to address three questions: (1) What is the best recipe to identify the least knowledgeable subset of experts that can be dropped with minimal impact on performance? (2) How should we perform expert dropping (one-shot or iterative), and what correction measures can we undertake to minimize its drastic impact on SMoE subnetwork capabilities? (3) What capabilities of full-SMoEs are severely impacted by the removal of the least dominant experts, and how can we recover them? Firstly, we propose MoE Experts Compression Suite (MC-Suite), which is a collection of some previously explored and multiple novel recipes to provide a comprehensive benchmark for estimating expert importance from diverse perspectives, as well as unveil numerous valuable insights for SMoE experts. Secondly, unlike prior works with a one-shot expert pruning approach, we explore the benefits of iterative pruning with the re-estimation of the MC-Suite criterion. Moreover, we introduce the benefits of task-agnostic fine-tuning as a correction mechanism during iterative expert dropping, which we term MoE Lottery Subnetworks. Lastly, we present an experimentally validated conjecture that, during expert dropping, SMoEs' instruction-following capabilities are predominantly hurt, which can be restored to a robust level subject to external augmentation of instruction-following capabilities using k-shot examples and supervised fine-tuning.

Paper Structure

This paper contains 21 sections, 16 equations, 15 figures, 7 tables.

Figures (15)

  • Figure 1: MoE Experts Compression Suite (MC-Suite): A comprehensive basket of criterions ($c$) to investigate dominant experts across different SMoE blocks from weight, expert behavior, intermediate activations, and gradient behavior perspective. Criterion with indicate it has been previously explored either in exactly the same formulation or with slight variation. Based on the score of a criterion (score$_{c}^{e}$) estimated within a MoE layer, an expert $(e)$ is identified and removed.
  • Figure 2: Wikitext Perplexity of Mixtral 8$\times$7B pretrained checkpoint when removing a single expert $e$ from layer $l$.
  • Figure 3: Overview of Different Expert Pruning Strategies: Given a target expert sparsity of $S\%$, (a) One-shot pruning: removes $S\%$ of experts from each layer $L$ from MoE based on one-time estimation of criterion $c$; (b) Iterative pruning: removes $S/k\%$ of experts before re-estimation of criterion $c$ for $k$-rounds; (a) MoE Lottery pruning: removes $S/k\%$ of experts followed by task-agnostic budget finetuning using calibration data before re-estimation of criterion $c$ for $k$-rounds.
  • Figure 4: Performance comparison (perplexity on C4) of Mixtral-8$\times$7B Base Lottery Subnetworks identified by dropping experts iteratively using various criterions from MC-Suite. Original Mixtral-8$\times$7B Base checkpoint achieves 7.44 perplexity on C4 validation set. Min & Max represents an expert ($e$) with minimum/maximum score of a criterion ($c$).
  • Figure 5: Dropped Experts Distribution with 50% Sparsity: (a) Difference of experts identified to be dropped with one-shot pruning in comparison with moe-lottery pruning, (b) Difference of experts identified to be dropped with iterative pruning in comparison with moe-lottery pruning. Light Bisque color corresponding to an expert (e$_{L}^i$) indicates agreement across both pruning techniques to drop e$_{L}^i$, Dark pink indicates disagreement to drop, while Black indicates agreement to retain e$_{L}^i$.
  • ...and 10 more figures