Table of Contents
Fetching ...

GW-MoE: Resolving Uncertainty in MoE Router with Global Workspace Theory

Haoze Wu, Zihan Qiu, Zili Wang, Hang Zhao, Jie Fu

TL;DR

This work identifies pervasive routing uncertainty in large MoE models and introduces GW-MoE, a fine-tuning method guided by Global Workspace Theory that broadcasts uncertain tokens to all experts. By treating tokens with high routing entropy as uncertain and enabling cross-expert learning during training, GW-MoE preserves inference efficiency while improving performance across diverse NLP tasks and model scales. The approach yields consistent gains on GLUE, summarization, QA, and reasoning benchmarks, and ablations show that broadcasting only uncertain tokens, not all tokens, is crucial. The findings offer practical guidance for MoE router design and suggest that uncertainty-aware fine-tuning can enhance robust knowledge recall without adding inference cost.

Abstract

Mixture-of-Experts (MoE) has been demonstrated as an efficient method to scale up models. By dynamically and sparsely selecting activated experts, MoE can effectively reduce computational costs. Despite the success, we observe that many tokens in the MoE models have uncertain routing results. These tokens have nearly equal scores for choosing each expert, and we demonstrate that this uncertainty can lead to incorrect selections. Inspired by the Global Workspace Theory (GWT), we propose a new fine-tuning method, GW-MoE, to address this issue. The core idea is to broadcast the uncertain tokens across experts during fine-tuning. Therefore, these tokens can acquire the necessary knowledge from any expert during inference and become less sensitive to the choice. GW-MoE does not introduce additional inference overhead. We validate that GW can mitigate the uncertain problem and consistently improve in different tasks (text classification, question answering, summarization, code generation, and mathematical problem solving) and model sizes (650M and 8B parameters).

GW-MoE: Resolving Uncertainty in MoE Router with Global Workspace Theory

TL;DR

This work identifies pervasive routing uncertainty in large MoE models and introduces GW-MoE, a fine-tuning method guided by Global Workspace Theory that broadcasts uncertain tokens to all experts. By treating tokens with high routing entropy as uncertain and enabling cross-expert learning during training, GW-MoE preserves inference efficiency while improving performance across diverse NLP tasks and model scales. The approach yields consistent gains on GLUE, summarization, QA, and reasoning benchmarks, and ablations show that broadcasting only uncertain tokens, not all tokens, is crucial. The findings offer practical guidance for MoE router design and suggest that uncertainty-aware fine-tuning can enhance robust knowledge recall without adding inference cost.

Abstract

Mixture-of-Experts (MoE) has been demonstrated as an efficient method to scale up models. By dynamically and sparsely selecting activated experts, MoE can effectively reduce computational costs. Despite the success, we observe that many tokens in the MoE models have uncertain routing results. These tokens have nearly equal scores for choosing each expert, and we demonstrate that this uncertainty can lead to incorrect selections. Inspired by the Global Workspace Theory (GWT), we propose a new fine-tuning method, GW-MoE, to address this issue. The core idea is to broadcast the uncertain tokens across experts during fine-tuning. Therefore, these tokens can acquire the necessary knowledge from any expert during inference and become less sensitive to the choice. GW-MoE does not introduce additional inference overhead. We validate that GW can mitigate the uncertain problem and consistently improve in different tasks (text classification, question answering, summarization, code generation, and mathematical problem solving) and model sizes (650M and 8B parameters).
Paper Structure (26 sections, 6 equations, 5 figures, 10 tables)

This paper contains 26 sections, 6 equations, 5 figures, 10 tables.

Figures (5)

  • Figure 1: Randomly selecting experts for uncertain tokens can give better results. We let the uncertain tokens (entropy greater than $2.0$) in the last layer of JetMoE randomly select experts, and the average results (blue) from multiple experiments on three tasks are better than those obtained by using the Top-$K$ operator to select experts (dashed line). To further verify, we let the same proportion of arbitrary tokens randomly select experts and observe that the results (gray) are worse than uncertain random. The metrics for each task are the same as those in Sec \ref{['sec:IT']}.
  • Figure 2: Overview of GW-MoE. Left: Based on the GWT, some neural signals (grey) only need to activate a single functional module in the human brain, while others (blue) will use the global workspace to broadcast information, facilitating cooperation between modules. Right: GW-MoE is inspired by GWT. When the router's output score is nearly uniform, those tokens (blue) are called uncertain tokens and are broadcast to all experts during fine-tuning; during inference, since all experts have learned the knowledge of uncertain tokens, these tokens can obtain the necessary information from any expert. The rest (grey) are certain tokens, routed to the Top$K$ experts during both inference and fine-tuning, following standard MoE.
  • Figure 3: The $50$ most frequently broadcast tokens in JetMoE. Most of them do not have a clear semantic.
  • Figure 4: The $50$ most frequently broadcast tokens in the encoder of Switch-Base-8. These tokens are mostly common words with clear semantics.
  • Figure 5: The variation of EM with $H^*$. The dashed line indicates the result of standard fine-tuning.