Table of Contents
Fetching ...

EvoESAP: Non-Uniform Expert Pruning for Sparse MoE

Zongfang Liu, Shengkun Tang, Boyang Sun, Zhiqiang Shen, Xin Yuan

TL;DR

EvoESAP is proposed, an evolutionary searching framework that optimizes a non-uniform layer-wise sparsity allocation under a fixed global budget while holding the within-layer pruning order fixed, making it a plug-and-play method with criteria such as Frequency, EAN, SEER, and REAP.

Abstract

Sparse Mixture-of-Experts (SMoE) language models achieve strong capability at low per-token compute, yet deployment remains memory- and throughput-bound because the full expert pool must be stored and served. Post-training expert pruning reduces this cost, but most methods focus on which experts to prune within each layer and default to a uniform layer-wise sparsity allocation, even though the allocation can strongly affect performance. We decouple pruning into within-layer expert ranking and across-layer budget allocation, and introduce \textbf{E}xpected \textbf{S}peculative \textbf{A}cceptance \textbf{P}roxy (\textbf{ESAP}), a speculative-decoding-inspired, teacher-forced metric that measures how well a pruned model matches the full model. ESAP is bounded and stable, enabling cheap comparison of many candidates without costly autoregressive decoding. Building on ESAP, we propose EvoESAP, an evolutionary searching framework that optimizes a non-uniform layer-wise sparsity allocation under a fixed global budget while holding the within-layer pruning order fixed, making it a plug-and-play method with criteria such as Frequency, EAN, SEER, and REAP. Across 7B--30B SMoE LLMs at 25\% and 50\% sparsity, EvoESAP consistently discovers non-uniform allocations that improve open-ended generation (up to \textbf{+19.6\%} on MATH-500 at 50\% sparsity) while preserving competitive multiple-choice accuracy compared with uniform pruning at the same sparsity.

EvoESAP: Non-Uniform Expert Pruning for Sparse MoE

TL;DR

EvoESAP is proposed, an evolutionary searching framework that optimizes a non-uniform layer-wise sparsity allocation under a fixed global budget while holding the within-layer pruning order fixed, making it a plug-and-play method with criteria such as Frequency, EAN, SEER, and REAP.

Abstract

Sparse Mixture-of-Experts (SMoE) language models achieve strong capability at low per-token compute, yet deployment remains memory- and throughput-bound because the full expert pool must be stored and served. Post-training expert pruning reduces this cost, but most methods focus on which experts to prune within each layer and default to a uniform layer-wise sparsity allocation, even though the allocation can strongly affect performance. We decouple pruning into within-layer expert ranking and across-layer budget allocation, and introduce \textbf{E}xpected \textbf{S}peculative \textbf{A}cceptance \textbf{P}roxy (\textbf{ESAP}), a speculative-decoding-inspired, teacher-forced metric that measures how well a pruned model matches the full model. ESAP is bounded and stable, enabling cheap comparison of many candidates without costly autoregressive decoding. Building on ESAP, we propose EvoESAP, an evolutionary searching framework that optimizes a non-uniform layer-wise sparsity allocation under a fixed global budget while holding the within-layer pruning order fixed, making it a plug-and-play method with criteria such as Frequency, EAN, SEER, and REAP. Across 7B--30B SMoE LLMs at 25\% and 50\% sparsity, EvoESAP consistently discovers non-uniform allocations that improve open-ended generation (up to \textbf{+19.6\%} on MATH-500 at 50\% sparsity) while preserving competitive multiple-choice accuracy compared with uniform pruning at the same sparsity.
Paper Structure (22 sections, 22 equations, 3 figures, 9 tables, 1 algorithm)

This paper contains 22 sections, 22 equations, 3 figures, 9 tables, 1 algorithm.

Figures (3)

  • Figure 1: Layer-wise density schedules and performance for OLMoE-1B-7B-0125-Instruct at 25% global sparsity (density = $1-\text{sparsity}$). Each panel shows the per-layer remaining expert density under a fixed global pruning budget, using REAP to rank experts in each layer (computed from 1,024 calibration samples from evol-codealpaca-v1). Uniform prunes the same fraction of experts in every layer. Frequency-based ranks experts globally by routing frequency, counts how many experts from each layer fall in the tail 25%, and uses those counts to set layer-wise sparsity. Searched finds non-uniform sparsity schedule with EvoESAP under the same budget. Numbers above panels report average performance on Code/Math/MC benchmarks, with deltas relative to uniform. The results imply that given the same pruning metric, non-uniform allocation has the potential to better preserve the model’s capabilities for SMoE expert pruning; however, finding an effective non-uniform allocation is non-trivial, and a poor allocation can harm overall performance.
  • Figure 2: Overview of EvoESAP.(a) Evolutionary search with budget-preserving level-switch mutation. Histograms visualize the layer-wise density distribution of each candidate model (density $=1-\,$sparsity) induced by an allocation $\mathbf{r}$ (experts removed per layer) under a fixed global budget $B$. Offspring are generated from the top $m$ survivors by a level-switch that transfers $\Delta$ units of pruning budget between two layers (gray: decrease; yellow: increase), keeping $\sum_{\ell} r_{\ell}=B$ unchanged. (b) ESAP as per-sample fitness. Under teacher forcing, ESAP scores a sample by the full-vocabulary overlap between the baseline/target next-token distribution (dark gray, $p(\cdot\mid x)$) and the candidate/draft distribution (blue, $q(\cdot\mid x)$), averaged over answer-token positions (higher is better; see \ref{['eq:esap_def']}).
  • Figure 3: Layer-wise density distributions (density $=1-\text{sparsity}$) of the searched non-uniform allocations across different pruning metrics.