Table of Contents
Fetching ...

THOR-MoE: Hierarchical Task-Guided and Context-Responsive Routing for Neural Machine Translation

Yunlong Liang, Fandong Meng, Jie Zhou

TL;DR

THOR-MoE tackles two key MoE limitations in neural machine translation: dependence on unavailable task knowledge and myopic token-level routing. It introduces a hierarchical task-guided routing component that automatically predicts domain/language and forms a mixed task representation to guide a task-level router, paired with a context-responsive token routing mechanism that injects global context into per-token routing decisions. The framework is designed as a plug-and-play module compatible with Top-$k$ and Top-$p$ routing and demonstrates consistent gains on multi-domain and multilingual translation benchmarks, achieving average BLEU improvements (e.g., around 0.75 BLEU over Top-$p$ baselines) with a fraction of activated parameters. These results suggest practical benefits for deploying MoE in diverse translation settings, improving both performance and efficiency while maintaining broad applicability across architectures.

Abstract

The sparse Mixture-of-Experts (MoE) has achieved significant progress for neural machine translation (NMT). However, there exist two limitations in current MoE solutions which may lead to sub-optimal performance: 1) they directly use the task knowledge of NMT into MoE (\emph{e.g.}, domain/linguistics-specific knowledge), which are generally unavailable at practical application and neglect the naturally grouped domain/linguistic properties; 2) the expert selection only depends on the localized token representation without considering the context, which fully grasps the state of each token in a global view. To address the above limitations, we propose THOR-MoE via arming the MoE with hierarchical task-guided and context-responsive routing policies. Specifically, it 1) firstly predicts the domain/language label and then extracts mixed domain/language representation to allocate task-level experts in a hierarchical manner; 2) injects the context information to enhance the token routing from the pre-selected task-level experts set, which can help each token to be accurately routed to more specialized and suitable experts. Extensive experiments on multi-domain translation and multilingual translation benchmarks with different architectures consistently demonstrate the superior performance of THOR-MoE. Additionally, the THOR-MoE operates as a plug-and-play module compatible with existing Top-$k$~\cite{shazeer2017} and Top-$p$~\cite{huang-etal-2024-harder} routing schemes, ensuring broad applicability across diverse MoE architectures. For instance, compared with vanilla Top-$p$~\cite{huang-etal-2024-harder} routing, the context-aware manner can achieve an average improvement of 0.75 BLEU with less than 22\% activated parameters on multi-domain translation tasks.

THOR-MoE: Hierarchical Task-Guided and Context-Responsive Routing for Neural Machine Translation

TL;DR

THOR-MoE tackles two key MoE limitations in neural machine translation: dependence on unavailable task knowledge and myopic token-level routing. It introduces a hierarchical task-guided routing component that automatically predicts domain/language and forms a mixed task representation to guide a task-level router, paired with a context-responsive token routing mechanism that injects global context into per-token routing decisions. The framework is designed as a plug-and-play module compatible with Top- and Top- routing and demonstrates consistent gains on multi-domain and multilingual translation benchmarks, achieving average BLEU improvements (e.g., around 0.75 BLEU over Top- baselines) with a fraction of activated parameters. These results suggest practical benefits for deploying MoE in diverse translation settings, improving both performance and efficiency while maintaining broad applicability across architectures.

Abstract

The sparse Mixture-of-Experts (MoE) has achieved significant progress for neural machine translation (NMT). However, there exist two limitations in current MoE solutions which may lead to sub-optimal performance: 1) they directly use the task knowledge of NMT into MoE (\emph{e.g.}, domain/linguistics-specific knowledge), which are generally unavailable at practical application and neglect the naturally grouped domain/linguistic properties; 2) the expert selection only depends on the localized token representation without considering the context, which fully grasps the state of each token in a global view. To address the above limitations, we propose THOR-MoE via arming the MoE with hierarchical task-guided and context-responsive routing policies. Specifically, it 1) firstly predicts the domain/language label and then extracts mixed domain/language representation to allocate task-level experts in a hierarchical manner; 2) injects the context information to enhance the token routing from the pre-selected task-level experts set, which can help each token to be accurately routed to more specialized and suitable experts. Extensive experiments on multi-domain translation and multilingual translation benchmarks with different architectures consistently demonstrate the superior performance of THOR-MoE. Additionally, the THOR-MoE operates as a plug-and-play module compatible with existing Top-~\cite{shazeer2017} and Top-~\cite{huang-etal-2024-harder} routing schemes, ensuring broad applicability across diverse MoE architectures. For instance, compared with vanilla Top-~\cite{huang-etal-2024-harder} routing, the context-aware manner can achieve an average improvement of 0.75 BLEU with less than 22\% activated parameters on multi-domain translation tasks.

Paper Structure

This paper contains 32 sections, 16 equations, 2 figures, 10 tables.

Figures (2)

  • Figure 1: The overview of the proposed THOR-MoE. 1) vanilla token routing; 2) hierarchical task-guided routing; 3) context-responsive token routing. The 'Emb.' and 'Rep.' denotes the embedding and representation, respectively. The hierarchical manner denotes that the task-guided routing firstly selects different task-specific experts sets for different queries (e.g., $\mathcal{E}_t$ for query 1). Then the context-responsive token routing assigns experts from $\mathcal{S}^t$ for each token in query 1.
  • Figure 2: Average activated experts number across training steps on the multi-domain translation task.