Table of Contents
Fetching ...

Retraining-Free Merging of Sparse MoE via Hierarchical Clustering

I-Chun Chen, Hsu-Shen Liu, Wei-Fang Sun, Chen-Hao Chao, Yen-Chang Hsu, Chun-Yi Lee

TL;DR

This work tackles memory bottlenecks in sparse Mixture-of-Experts (SMoE) models by introducing HC-SMoE, a retraining-free merging framework that uses hierarchical clustering on expert outputs to identify functionally similar groups. By clustering with average linkage and merging within clusters through frequency-weighted averaging, HC-SMoE preserves model behavior while reducing the number of active experts. The approach achieves strong zero-shot performance on large LLMs like Qwen and Mixtral, often surpassing retraining-free baselines and approaching, or even matching, the original models under significant reductions. The proposed calibration-data-based output similarity provides robustness across datasets and tasks, enabling practical deployment of compressed SMoE models without retraining.

Abstract

Sparse Mixture-of-Experts (SMoE) models represent a significant advancement in large language model (LLM) development through their efficient parameter utilization. These models achieve substantial performance improvements at reduced inference costs. However, the deployment of SMoE models faces constraints from extensive memory requirements of expert components in resource-limited environments. To address these limitations, this paper introduces Hierarchical Clustering for Sparsely activated Mixture of Experts (HC-SMoE), a task-agnostic expert merging framework for parameter reduction without retraining. HC-SMoE introduces a novel hierarchical clustering approach based on expert outputs to ensure merging robustness independent of routing decisions. The proposed output-based clustering method enables effective capture of functional relationships between experts for large-scale architectures. We provide theoretical analysis and comprehensive evaluations across multiple zero-shot language tasks to demonstrate HC-SMoE's effectiveness in state-of-the-art models including Qwen and Mixtral. The experimental results validate HC-SMoE's superior performance and practical applicability for real-world deployments.

Retraining-Free Merging of Sparse MoE via Hierarchical Clustering

TL;DR

This work tackles memory bottlenecks in sparse Mixture-of-Experts (SMoE) models by introducing HC-SMoE, a retraining-free merging framework that uses hierarchical clustering on expert outputs to identify functionally similar groups. By clustering with average linkage and merging within clusters through frequency-weighted averaging, HC-SMoE preserves model behavior while reducing the number of active experts. The approach achieves strong zero-shot performance on large LLMs like Qwen and Mixtral, often surpassing retraining-free baselines and approaching, or even matching, the original models under significant reductions. The proposed calibration-data-based output similarity provides robustness across datasets and tasks, enabling practical deployment of compressed SMoE models without retraining.

Abstract

Sparse Mixture-of-Experts (SMoE) models represent a significant advancement in large language model (LLM) development through their efficient parameter utilization. These models achieve substantial performance improvements at reduced inference costs. However, the deployment of SMoE models faces constraints from extensive memory requirements of expert components in resource-limited environments. To address these limitations, this paper introduces Hierarchical Clustering for Sparsely activated Mixture of Experts (HC-SMoE), a task-agnostic expert merging framework for parameter reduction without retraining. HC-SMoE introduces a novel hierarchical clustering approach based on expert outputs to ensure merging robustness independent of routing decisions. The proposed output-based clustering method enables effective capture of functional relationships between experts for large-scale architectures. We provide theoretical analysis and comprehensive evaluations across multiple zero-shot language tasks to demonstrate HC-SMoE's effectiveness in state-of-the-art models including Qwen and Mixtral. The experimental results validate HC-SMoE's superior performance and practical applicability for real-world deployments.

Paper Structure

This paper contains 32 sections, 13 equations, 13 figures, 23 tables, 1 algorithm.

Figures (13)

  • Figure 1: Effectiveness of expert parameter reduction approaches on Qwen1.5-MoE-A2.7B-Chat qwen_moe. Average accuracy across 8 LM-Harness benchmarks demonstrates HC-SMoE's superior performance over existing retraining-free pruning and merging baselines at 25%, 37.5%, and 50% expert parameter reduction rates. $\star$ indicates the original unpruned Qwen model performance.
  • Figure 2: Illustration of the proposed hierarchical clustering strategy based on expert outputs. Each blue circle denotes the outputs of an expert in the embedding space. Hierarchical clustering would iteratively group the expert clusters with minimum cluster distance.
  • Figure 3: Comparison of expert pruning and merging strategies.
  • Figure 4: Fix-dominant merging. Given experts within cluster and dominant expert index, Step 1. we first collect intermediate features from each experts. Step 2. Then, we use pairwise correlation to compare similarity between dominant expert and non-dominant experts. Step 3. Each non-dominant expert's dimension choose the dimension of highest similarity with its in the dominant expert feature as group. Step 4. Based on this grouping, we average merge each expert weights in each dimension.
  • Figure 5: General architecture of SMoE. The router uses top-2 routing to assign each token to the two experts with the highest scores.
  • ...and 8 more figures