Table of Contents
Fetching ...

GMoE: Empowering LLMs Fine-Tuning via MoE Graph Collaboration

Ting Bai, Yue Yu, Le Huang, Zenan Xu, Chuan Shi

TL;DR

This work tackles instability and load imbalance in sparse Mixture-of-Experts (MoE) during large-language-model fine-tuning by introducing GMoE, a graph-based MoE with a graph router that enables explicit cross-expert collaboration. It incorporates two coordination strategies—a Poisson distribution-based distinction to promote expert specialization and a Normal distribution-based balance to regulate workload—implemented within a parameter-efficient Fine-Tuning framework using LoRA. Empirical results on four real-world benchmarks across multiple base LLMs show that GMoE achieves state-of-the-art accuracy with improved stability (lower Std) while using fewer trainable parameters, thanks to the graph-empowered routing and efficient LoRA updates. This graph-based MoE framework offers a scalable, communication-enabled alternative to conventional router designs, with practical implications for stable, efficient fine-tuning of LLMs.

Abstract

The sparse Mixture-of-Experts (MoE) architecture of large language models (LLMs) confronts an inherent issue of load imbalance arising from the simplistic linear router strategy, which ultimately causes the instability and inefficient learning of LLMs. To address this challenge, we introduce a novel MoE graph-based framework $\textbf{GMoE}$, aimed at enhancing the collaboration among multiple experts. In GMoE, a graph router function is designed to capture the collaboration signals among experts. This enables all experts to dynamically allocate information derived from input data by sharing information with their neighboring experts. Moreover, we put forward two coordination strategies in GMoE: the $\textit{Poisson distribution-based distinction strategy}$ and the $\textit{Normal distribution-based balance strategy}$, to further release the capacity of each expert and increase the model stability in the fine-tuning of LLMs. Specifically, we leverage a parameter-efficient fine-tuning technique, i.e., Low-Rank Adaptation (LoRA), to implement the graph MoE architecture. Extensive experiments on four real-world benchmark datasets demonstrate the effectiveness of GMoE, showing the benefits of facilitating collaborations of multiple experts in LLM fine-tuning. The code of experimental implementation is available at https://github.com/BAI-LAB/GMoE

GMoE: Empowering LLMs Fine-Tuning via MoE Graph Collaboration

TL;DR

This work tackles instability and load imbalance in sparse Mixture-of-Experts (MoE) during large-language-model fine-tuning by introducing GMoE, a graph-based MoE with a graph router that enables explicit cross-expert collaboration. It incorporates two coordination strategies—a Poisson distribution-based distinction to promote expert specialization and a Normal distribution-based balance to regulate workload—implemented within a parameter-efficient Fine-Tuning framework using LoRA. Empirical results on four real-world benchmarks across multiple base LLMs show that GMoE achieves state-of-the-art accuracy with improved stability (lower Std) while using fewer trainable parameters, thanks to the graph-empowered routing and efficient LoRA updates. This graph-based MoE framework offers a scalable, communication-enabled alternative to conventional router designs, with practical implications for stable, efficient fine-tuning of LLMs.

Abstract

The sparse Mixture-of-Experts (MoE) architecture of large language models (LLMs) confronts an inherent issue of load imbalance arising from the simplistic linear router strategy, which ultimately causes the instability and inefficient learning of LLMs. To address this challenge, we introduce a novel MoE graph-based framework , aimed at enhancing the collaboration among multiple experts. In GMoE, a graph router function is designed to capture the collaboration signals among experts. This enables all experts to dynamically allocate information derived from input data by sharing information with their neighboring experts. Moreover, we put forward two coordination strategies in GMoE: the and the , to further release the capacity of each expert and increase the model stability in the fine-tuning of LLMs. Specifically, we leverage a parameter-efficient fine-tuning technique, i.e., Low-Rank Adaptation (LoRA), to implement the graph MoE architecture. Extensive experiments on four real-world benchmark datasets demonstrate the effectiveness of GMoE, showing the benefits of facilitating collaborations of multiple experts in LLM fine-tuning. The code of experimental implementation is available at https://github.com/BAI-LAB/GMoE

Paper Structure

This paper contains 32 sections, 9 equations, 4 figures, 1 table.

Figures (4)

  • Figure 1: The overview of MoE architectures in the FFN layer. (a) The LoRA component is applied in the FFN layer of the transformer block. (b) The typical MoE architecture with LoRA in LLMs with a linear router function to assign weights. (c) Our proposed GMoE architecture with a graph router based on the MoE graph. For different input information (input1 and input2), the distinctive capability of experts is optimized by the Poisson distinction loss. For all input information, the activated frequency of each expert is balanced by the normal distribution loss.
  • Figure 2: The model performance of degradation variants of GMoE on Qwen2-7B.
  • Figure 3: The hyper-parameters analysis in ARC-Challenge and BoolQ dataset based on Qwen2-7B.
  • Figure 4: Efficiency comparisons across MoE methods. Illustrated with Trainable parameters and Throughput (samples/second) on the x-axis and average accuracy across four datasets on the y-axis.