Cluster-Driven Expert Pruning for Mixture-of-Experts Large Language Models
Hongcheng Guo, Juntao Yao, Boyang Wang, Junjia Du, Shaosheng Cao, Donglin Di, Shun Zhang, Zhoujun Li
TL;DR
Mixture-of-Experts LLMs incur a large full-parameter footprint despite sparse activation. The paper presents Cluster-driven Expert Pruning (C-Prune), a two-stage, similarity-aware framework that first performs layerwise clustering to expose intra-layer redundancy and then global clustering to remove redundant cross-layer clusters, guided by a unified objective. C-Prune merges experts within clusters and adapts routing to preserve task performance, achieving 25-35% parameter reduction with competitive or superior accuracy compared to existing pruning methods, especially at low compression. Across MoE variants and benchmarks, C-Prune reveals depth-dependent homogeneity and domain-specific pruning advantages, offering a practical path for efficient deployment of MoE LLMs in real-world settings.
Abstract
Mixture-of-Experts (MoE) architectures have emerged as a promising paradigm for scaling large language models (LLMs) with sparse activation of task-specific experts. Despite their computational efficiency during inference, the massive overall parameter footprint of MoE models (e.g., GPT-4) introduces critical challenges for practical deployment. Current pruning approaches often fail to address two inherent characteristics of MoE systems: 1).intra-layer expert homogeneity where experts within the same MoE layer exhibit functional redundancy, and 2). inter-layer similarity patterns where deeper layers tend to contain progressively more homogeneous experts. To tackle these issues, we propose Cluster-driven Expert Pruning (C-Prune), a novel two-stage framework for adaptive task-specific compression of MoE LLMs. C-Prune operates through layer-wise expert clustering, which groups functionally similar experts within each MoE layer using parameter similarity metrics, followed by global cluster pruning, which eliminates redundant clusters across all layers through a unified importance scoring mechanism that accounts for cross-layer homogeneity. We validate C-Prune through extensive experiments on multiple MoE models and benchmarks. The results demonstrate that C-Prune effectively reduces model size while outperforming existing MoE pruning methods.
