Table of Contents
Fetching ...

Breaking the MoE LLM Trilemma: Dynamic Expert Clustering with Structured Compression

Peijun Zhu, Ning Yang, Jiayu Wei, Jinghang Wu, Haijun Zhang

TL;DR

This work tackles the MoE LLM optimization trilemma—load imbalance, parameter redundancy, and communication overhead—by introducing online dual-similarity clustering to dynamically group experts and a shared-base plus low-rank residual parameterization within each group. A two-stage hierarchical routing strategy reduces routing complexity and inter-device communication, while heterogeneous precision and dynamic offloading keep peak memory comparable to dense models. Empirical results on GLUE and WikiText-103 show the approach matches standard MoE quality while reducing total parameters by approximately $80\%$, increasing throughput by $10\%$ to $20\%$, and lowering expert load variance by more than a factor of 3, with memory benefits amplified by offloading and quantization. Overall, the paper demonstrates that principled structural reorganization and co-design can yield scalable, efficient, and memory-conscious MoE LLMs.

Abstract

Mixture-of-Experts (MoE) Large Language Models (LLMs) face a trilemma of load imbalance, parameter redundancy, and communication overhead. We introduce a unified framework based on dynamic expert clustering and structured compression to address these issues cohesively. Our method employs an online clustering procedure that periodically regroups experts using a fused metric of parameter and activation similarity, which stabilizes expert utilization. To our knowledge, this is one of the first frameworks to leverage the semantic embedding capability of the router to dynamically reconfigure the model's architecture during training for substantial efficiency gains. Within each cluster, we decompose expert weights into a shared base matrix and extremely low-rank residual adapters, achieving up to fivefold parameter reduction per group while preserving specialization. This structure enables a two-stage hierarchical routing strategy: tokens are first assigned to a cluster, then to specific experts within it, drastically reducing the routing search space and the volume of all-to-all communication. Furthermore, a heterogeneous precision scheme, which stores shared bases in FP16 and residual factors in INT4, coupled with dynamic offloading of inactive clusters, reduces peak memory consumption to levels comparable to dense models. Evaluated on GLUE and WikiText-103, our framework matches the quality of standard MoE models while reducing total parameters by approximately 80%, improving throughput by 10% to 20%, and lowering expert load variance by a factor of over three. Our work demonstrates that structural reorganization is a principled path toward scalable, efficient, and memory-effective MoE LLMs.

Breaking the MoE LLM Trilemma: Dynamic Expert Clustering with Structured Compression

TL;DR

This work tackles the MoE LLM optimization trilemma—load imbalance, parameter redundancy, and communication overhead—by introducing online dual-similarity clustering to dynamically group experts and a shared-base plus low-rank residual parameterization within each group. A two-stage hierarchical routing strategy reduces routing complexity and inter-device communication, while heterogeneous precision and dynamic offloading keep peak memory comparable to dense models. Empirical results on GLUE and WikiText-103 show the approach matches standard MoE quality while reducing total parameters by approximately , increasing throughput by to , and lowering expert load variance by more than a factor of 3, with memory benefits amplified by offloading and quantization. Overall, the paper demonstrates that principled structural reorganization and co-design can yield scalable, efficient, and memory-conscious MoE LLMs.

Abstract

Mixture-of-Experts (MoE) Large Language Models (LLMs) face a trilemma of load imbalance, parameter redundancy, and communication overhead. We introduce a unified framework based on dynamic expert clustering and structured compression to address these issues cohesively. Our method employs an online clustering procedure that periodically regroups experts using a fused metric of parameter and activation similarity, which stabilizes expert utilization. To our knowledge, this is one of the first frameworks to leverage the semantic embedding capability of the router to dynamically reconfigure the model's architecture during training for substantial efficiency gains. Within each cluster, we decompose expert weights into a shared base matrix and extremely low-rank residual adapters, achieving up to fivefold parameter reduction per group while preserving specialization. This structure enables a two-stage hierarchical routing strategy: tokens are first assigned to a cluster, then to specific experts within it, drastically reducing the routing search space and the volume of all-to-all communication. Furthermore, a heterogeneous precision scheme, which stores shared bases in FP16 and residual factors in INT4, coupled with dynamic offloading of inactive clusters, reduces peak memory consumption to levels comparable to dense models. Evaluated on GLUE and WikiText-103, our framework matches the quality of standard MoE models while reducing total parameters by approximately 80%, improving throughput by 10% to 20%, and lowering expert load variance by a factor of over three. Our work demonstrates that structural reorganization is a principled path toward scalable, efficient, and memory-effective MoE LLMs.

Paper Structure

This paper contains 23 sections, 11 equations, 2 figures, 3 tables.

Figures (2)

  • Figure 1: Overview of the online dual-similarity clustering and intra-group structured compression. Experts are dynamically clustered based on a fused similarity metric. Within each group, experts are compressed into a shared base matrix and low-rank residual adapters
  • Figure 2: For each token, its representation is first used to compute affinities with all group prototypes, selecting the top group(s). Subsequently, the same token representation is compared only to the experts within the selected group(s) for fine-grained expert assignment.