Table of Contents
Fetching ...

GRAPHMOE: Amplifying Cognitive Depth of Mixture-of-Experts Network via Introducing Self-Rethinking Mechanism

Bo Lv, Chen Tang, Zifan Zheng, Bohao Yang, Kun Zhao, Ning Liao, Xiaoxing Wang, Feiyu Xiong, Zhiyu Li, Nayu Liu, Jingchi Jiang

TL;DR

GraphMoE introduces a self-rethinking mechanism that interconnects MoE expert nodes with a recurrent routing process on a pseudo-graph, enabling iterative refinement of representations. Implemented with LoRA-based adapters, it achieves state-of-the-art results across multiple commonsense benchmarks, surpassing existing LoRA+MoE baselines. The approach emphasizes balanced expert collaboration and controlled complexity, uncovering a path toward more powerful reasoning in language models. Overall, the work demonstrates that graph-based, multi-round routing can enhance cognitive depth in MoE architectures with modest parameter overhead, inviting further exploration of iterative, graph-guided MoE designs.

Abstract

Traditional Mixture-of-Experts (MoE) networks benefit from utilizing multiple smaller expert models as opposed to a single large network. However, these experts typically operate independently, leaving a question open about whether interconnecting these models could enhance the performance of MoE networks. In response, we introduce GRAPHMOE, a novel method aimed at augmenting the cognitive depth of language models via a self-rethinking mechanism constructed on Pseudo GraphMoE networks. GRAPHMOE employs a recurrent routing strategy to simulate iterative thinking steps, thereby facilitating the flow of information among expert nodes. We implement the GRAPHMOE architecture using Low-Rank Adaptation techniques (LoRA) and conduct extensive experiments on various benchmark datasets. The experimental results reveal that GRAPHMOE outperforms other LoRA based models, achieving state-of-the-art (SOTA) performance. Additionally, this study explores a novel recurrent routing strategy that may inspire further advancements in enhancing the reasoning capabilities of language models.

GRAPHMOE: Amplifying Cognitive Depth of Mixture-of-Experts Network via Introducing Self-Rethinking Mechanism

TL;DR

GraphMoE introduces a self-rethinking mechanism that interconnects MoE expert nodes with a recurrent routing process on a pseudo-graph, enabling iterative refinement of representations. Implemented with LoRA-based adapters, it achieves state-of-the-art results across multiple commonsense benchmarks, surpassing existing LoRA+MoE baselines. The approach emphasizes balanced expert collaboration and controlled complexity, uncovering a path toward more powerful reasoning in language models. Overall, the work demonstrates that graph-based, multi-round routing can enhance cognitive depth in MoE architectures with modest parameter overhead, inviting further exploration of iterative, graph-guided MoE designs.

Abstract

Traditional Mixture-of-Experts (MoE) networks benefit from utilizing multiple smaller expert models as opposed to a single large network. However, these experts typically operate independently, leaving a question open about whether interconnecting these models could enhance the performance of MoE networks. In response, we introduce GRAPHMOE, a novel method aimed at augmenting the cognitive depth of language models via a self-rethinking mechanism constructed on Pseudo GraphMoE networks. GRAPHMOE employs a recurrent routing strategy to simulate iterative thinking steps, thereby facilitating the flow of information among expert nodes. We implement the GRAPHMOE architecture using Low-Rank Adaptation techniques (LoRA) and conduct extensive experiments on various benchmark datasets. The experimental results reveal that GRAPHMOE outperforms other LoRA based models, achieving state-of-the-art (SOTA) performance. Additionally, this study explores a novel recurrent routing strategy that may inspire further advancements in enhancing the reasoning capabilities of language models.
Paper Structure (25 sections, 9 equations, 6 figures, 6 tables, 1 algorithm)

This paper contains 25 sections, 9 equations, 6 figures, 6 tables, 1 algorithm.

Figures (6)

  • Figure 1: Comparison between LoRA+MoE and GraphMoE architectures.
  • Figure 2: Overview of GraphMoE architecture. In this figure, the original Feed-Forward Network (FFN) layer in each transformer block is modified. FFN is also known as a Multi-Layer Perceptron (MLP).
  • Figure 3: The mechanism how a virtual node collect and aggregate features from expert graph nodles.
  • Figure 4: The workloads of all experts are shown by normalizing the selected time during every routing step. Red and blue dashed lines indicate the maximum and minimum workloads across all tasks, while the green dashed line shows the average workload. The MixLoRA model and the GraphMoE model have standard deviations of $0.0313$ and $0.0215$, respectively, in workload balance.
  • Figure 5: Sensitivity analysis of the additional hyperparameters in the GraphMoE architecture. The impact on GraphMoE's computational overhead is demonstrated by examining how the inference time scales with the Reasoning Round ($T$). In subfigure (c), "Infer. Time" refers to the inference time, with the duration of the first round set as a unit. The figure displays the factors by which the time of each subsequent round compares to that of the first round.
  • ...and 1 more figures