Table of Contents
Fetching ...

MoE Pathfinder: Trajectory-driven Expert Pruning

Xican Yang, Yuanhe Tian, Yan Song

TL;DR

This work tackles the deployment and efficiency bottlenecks of Mixture-of-Experts models by introducing a trajectory-driven pruning framework. It reformulates MoE as a layered, weighted graph and performs global path planning using transition intensities, expert importance, and reconstruction signals to prune along top information-propagating trajectories. The method yields non-uniform, cross-layer pruning and demonstrates superior pruning performance across six benchmarks and two Mixtral models, while preserving core knowledge and routing logic. The approach offers a practical, interpretable MoE compression strategy with potential for scalable deployment in large language models.

Abstract

Mixture-of-experts (MoE) architectures used in large language models (LLMs) achieve state-of-the-art performance across diverse tasks yet face practical challenges such as deployment complexity and low activation efficiency. Expert pruning has thus emerged as a promising solution to reduce computational overhead and simplify the deployment of MoE models. However, existing expert pruning approaches conventionally rely on local importance metrics and often apply uniform layer-wise pruning, leveraging only partial evaluation signals and overlooking the heterogeneous contributions of experts across layers. To address these limitations, we propose an expert pruning approach based on the trajectory of activated experts across layers, which treats MoE as a weighted computation graph and casts expert selection as a global optimal path planning problem. Within this framework, we integrate complementary importance signals from reconstruction error, routing probabilities, and activation strength at the trajectory level, which naturally yields non-uniform expert retention across layers. Experiments show that our approach achieves superior pruning performance on nearly all tasks compared with most existing approaches.

MoE Pathfinder: Trajectory-driven Expert Pruning

TL;DR

This work tackles the deployment and efficiency bottlenecks of Mixture-of-Experts models by introducing a trajectory-driven pruning framework. It reformulates MoE as a layered, weighted graph and performs global path planning using transition intensities, expert importance, and reconstruction signals to prune along top information-propagating trajectories. The method yields non-uniform, cross-layer pruning and demonstrates superior pruning performance across six benchmarks and two Mixtral models, while preserving core knowledge and routing logic. The approach offers a practical, interpretable MoE compression strategy with potential for scalable deployment in large language models.

Abstract

Mixture-of-experts (MoE) architectures used in large language models (LLMs) achieve state-of-the-art performance across diverse tasks yet face practical challenges such as deployment complexity and low activation efficiency. Expert pruning has thus emerged as a promising solution to reduce computational overhead and simplify the deployment of MoE models. However, existing expert pruning approaches conventionally rely on local importance metrics and often apply uniform layer-wise pruning, leveraging only partial evaluation signals and overlooking the heterogeneous contributions of experts across layers. To address these limitations, we propose an expert pruning approach based on the trajectory of activated experts across layers, which treats MoE as a weighted computation graph and casts expert selection as a global optimal path planning problem. Within this framework, we integrate complementary importance signals from reconstruction error, routing probabilities, and activation strength at the trajectory level, which naturally yields non-uniform expert retention across layers. Experiments show that our approach achieves superior pruning performance on nearly all tasks compared with most existing approaches.

Paper Structure

This paper contains 18 sections, 13 equations, 3 figures, 5 tables.

Figures (3)

  • Figure 1: Overview of the proposed expert pruning framework based on the trajectory of activated experts across layers. The top part presents the standard multi-layer MoE architecture, where tokens are dynamically routed to sparse experts. The bottom part shows our pruning approach that reformulates the MoE as a directed weighted graph. In our approach, we firstly compute transition intensities (edge weights) and expert importance scores (node weights) based on routing probabilities, expert activations, and reconstruction loss, respectively. Then, we use a global path planning algorithm to identify the top-ranked optimal inference trajectories. Experts located on these critical paths are retained (highlighted in purple), while the remaining redundant experts are pruned.
  • Figure 2: Visualization of expert importance for the Mixtral-8x7B model. These heatmaps illustrate the layer-wise and expert-specific importance, quantified by the frequency an expert is selected during the path planning phase. Calibration data are partitioned into $K=10$ clusters, and expert frequencies are computed by sampling the top-100 paths for each cluster. An expert's importance is determined by its selection frequency, where a higher selection frequency (indicated by a darker color) signifies greater importance. The results include a general-domain task, MMLU (top), and two domain-specific tasks, MedQA (middle) and GSM8K (bottom).
  • Figure 3: Effect of number of clusters ($K$) on the performance of the Mixtral-8x7B model pruned to 50% expert sparsity. The horizontal axis denotes the number of k-means clusters ($K$) used to construct calibration data, and the vertical axis shows the model's accuracy on WinoGrande (blue) and ARC (red).