Table of Contents
Fetching ...

Mosaic Pruning: A Hierarchical Framework for Generalizable Pruning of Mixture-of-Experts Models

Wentao Hu, Mingkuan Zhao, Shuangyong Song, Xiaoyan Zhu, Xin Lai, Jiayin Wang

TL;DR

MoE models incur large static memory overhead and pruning with a general corpus causes functional collapse on domain-specific tasks. Mosaic Pruning (MoP) introduces a hierarchical cluster-then-select pruning framework that (i) retains a core group of general experts, (ii) uses a data-driven domain discovery to reveal latent functional domains, (iii) builds a domain-aware similarity matrix and clusters experts with Ward linkage, and (iv) selects cluster representatives via Activation Variability Score $S_{ ext{var}}$. Key contributions include the Activation Variability Score, a data-driven domain discovery and clustering strategy, and strong empirical gains: an average $7.24\%$ improvement on general tasks and $8.92\%$ on specialized tasks, with deployment efficiency gains. This approach enables prune-once deployment of MoE models across diverse downstream tasks, enhancing practical applicability and robustness to domain shifts.

Abstract

Sparse Mixture-of-Experts (SMoE) architectures have enabled a new frontier in scaling Large Language Models (LLMs), offering superior performance by activating only a fraction of their total parameters during inference. However, their practical deployment is severely hampered by substantial static memory overhead, as all experts must be loaded into memory. Existing post-training pruning methods, while reducing model size, often derive their pruning criteria from a single, general-purpose corpus. This leads to a critical limitation: a catastrophic performance degradation when the pruned model is applied to other domains, necessitating a costly re-pruning for each new domain. To address this generalization gap, we introduce Mosaic Pruning (MoP). The core idea of MoP is to construct a functionally comprehensive set of experts through a structured ``cluster-then-select" process. This process leverages a similarity metric that captures expert performance across different task domains to functionally cluster the experts, and subsequently selects the most representative expert from each cluster based on our proposed Activation Variability Score. Unlike methods that optimize for a single corpus, our proposed Mosaic Pruning ensures that the pruned model retains a functionally complementary set of experts, much like the tiles of a mosaic that together form a complete picture of the original model's capabilities, enabling it to handle diverse downstream tasks.Extensive experiments on various MoE models demonstrate the superiority of our approach. MoP significantly outperforms prior work, achieving a 7.24\% gain on general tasks and 8.92\% on specialized tasks like math reasoning and code generation.

Mosaic Pruning: A Hierarchical Framework for Generalizable Pruning of Mixture-of-Experts Models

TL;DR

MoE models incur large static memory overhead and pruning with a general corpus causes functional collapse on domain-specific tasks. Mosaic Pruning (MoP) introduces a hierarchical cluster-then-select pruning framework that (i) retains a core group of general experts, (ii) uses a data-driven domain discovery to reveal latent functional domains, (iii) builds a domain-aware similarity matrix and clusters experts with Ward linkage, and (iv) selects cluster representatives via Activation Variability Score . Key contributions include the Activation Variability Score, a data-driven domain discovery and clustering strategy, and strong empirical gains: an average improvement on general tasks and on specialized tasks, with deployment efficiency gains. This approach enables prune-once deployment of MoE models across diverse downstream tasks, enhancing practical applicability and robustness to domain shifts.

Abstract

Sparse Mixture-of-Experts (SMoE) architectures have enabled a new frontier in scaling Large Language Models (LLMs), offering superior performance by activating only a fraction of their total parameters during inference. However, their practical deployment is severely hampered by substantial static memory overhead, as all experts must be loaded into memory. Existing post-training pruning methods, while reducing model size, often derive their pruning criteria from a single, general-purpose corpus. This leads to a critical limitation: a catastrophic performance degradation when the pruned model is applied to other domains, necessitating a costly re-pruning for each new domain. To address this generalization gap, we introduce Mosaic Pruning (MoP). The core idea of MoP is to construct a functionally comprehensive set of experts through a structured ``cluster-then-select" process. This process leverages a similarity metric that captures expert performance across different task domains to functionally cluster the experts, and subsequently selects the most representative expert from each cluster based on our proposed Activation Variability Score. Unlike methods that optimize for a single corpus, our proposed Mosaic Pruning ensures that the pruned model retains a functionally complementary set of experts, much like the tiles of a mosaic that together form a complete picture of the original model's capabilities, enabling it to handle diverse downstream tasks.Extensive experiments on various MoE models demonstrate the superiority of our approach. MoP significantly outperforms prior work, achieving a 7.24\% gain on general tasks and 8.92\% on specialized tasks like math reasoning and code generation.

Paper Structure

This paper contains 27 sections, 13 equations, 3 figures, 5 tables.

Figures (3)

  • Figure 1: A conceptual illustration of the expert pruning strategies. (a) Unpruned MoE Layer: The original experts, with colors indicating different functional specializations. (b) Enu: Retains a functionally homogeneous set of experts (E1, E2, E3) by minimizing reconstruction loss. (c) GVP: Supplements core experts with globally selected specialists (E4, E5), improving diversity but risking functional overlap. (d) MoP: Clusters experts by functional similarity and selects a representative from each cluster, ensuring a final expert set that is both specialized and complementary.
  • Figure 2: The workflow of the Mosaic Pruning (MoP) framework. First, the calibration data is partitioned into distinct functional domains. Subsequently, a similarity matrix between experts is constructed based on their performance profiles across these domains. This matrix is used to cluster experts with high similarity into the same group. Finally, experts are selected within each cluster based on their Activation Variability Score.
  • Figure 3: Heatmaps of expert activation weights across different domains for Mixtral-8x7B pruned to 4 experts. (Top Row): The Enumeration Pruning leads to functional homogenization, with a few generalist experts dominating all tasks. (Bottom Row): Our MoP method preserves domain specialization, with different experts activating for different tasks.