Table of Contents
Fetching ...

Direct Neural Machine Translation with Task-level Mixture of Experts models

Isidora Chara Tourni, Subhajit Naskar

TL;DR

The paper addresses Direct NMT between non-English languages where direct parallel data is scarce. It applies Task-level Mixture-of-Experts to a multilingual Transformer, routing translation tasks by language pair or by target language to distribute learning across experts, and compares 16- and 64-expert configurations against bilingual and pivot baselines across 108 languages and 53 direct pairs. It contributes comprehensive BLEU comparisons and expert-routing analyses, showing that 16-expert LP/TL MoE often matches or surpasses baseline methods on many direct directions, while 64-expert variants yield mixed results, and it provides actionable guidance on routing strategies for different language pairs. The work offers a scalable, inference-efficient path to expand direct NMT coverage and informs design choices for mixture-of-experts in multilingual translation, with implications for deploying smaller task-specific dense models derived from the MoE.

Abstract

Direct neural machine translation (direct NMT) is a type of NMT system that translates text between two non-English languages. Direct NMT systems often face limitations due to the scarcity of parallel data between non-English language pairs. Several approaches have been proposed to address this limitation, such as multilingual NMT and pivot NMT (translation between two languages via English). Task-level Mixture of expert models (Task-level MoE), an inference-efficient variation of Transformer-based models, has shown promising NMT performance for a large number of language pairs. In Task-level MoE, different language groups can use different routing strategies to optimize cross-lingual learning and inference speed. In this work, we examine Task-level MoE's applicability in direct NMT and propose a series of high-performing training and evaluation configurations, through which Task-level MoE-based direct NMT systems outperform bilingual and pivot-based models for a large number of low and high-resource direct pairs, and translation directions. Our Task-level MoE with 16 experts outperforms bilingual NMT, Pivot NMT models for 7 language pairs, while pivot-based models still performed better in 9 pairs and directions.

Direct Neural Machine Translation with Task-level Mixture of Experts models

TL;DR

The paper addresses Direct NMT between non-English languages where direct parallel data is scarce. It applies Task-level Mixture-of-Experts to a multilingual Transformer, routing translation tasks by language pair or by target language to distribute learning across experts, and compares 16- and 64-expert configurations against bilingual and pivot baselines across 108 languages and 53 direct pairs. It contributes comprehensive BLEU comparisons and expert-routing analyses, showing that 16-expert LP/TL MoE often matches or surpasses baseline methods on many direct directions, while 64-expert variants yield mixed results, and it provides actionable guidance on routing strategies for different language pairs. The work offers a scalable, inference-efficient path to expand direct NMT coverage and informs design choices for mixture-of-experts in multilingual translation, with implications for deploying smaller task-specific dense models derived from the MoE.

Abstract

Direct neural machine translation (direct NMT) is a type of NMT system that translates text between two non-English languages. Direct NMT systems often face limitations due to the scarcity of parallel data between non-English language pairs. Several approaches have been proposed to address this limitation, such as multilingual NMT and pivot NMT (translation between two languages via English). Task-level Mixture of expert models (Task-level MoE), an inference-efficient variation of Transformer-based models, has shown promising NMT performance for a large number of language pairs. In Task-level MoE, different language groups can use different routing strategies to optimize cross-lingual learning and inference speed. In this work, we examine Task-level MoE's applicability in direct NMT and propose a series of high-performing training and evaluation configurations, through which Task-level MoE-based direct NMT systems outperform bilingual and pivot-based models for a large number of low and high-resource direct pairs, and translation directions. Our Task-level MoE with 16 experts outperforms bilingual NMT, Pivot NMT models for 7 language pairs, while pivot-based models still performed better in 9 pairs and directions.
Paper Structure (5 sections, 8 figures, 4 tables)

This paper contains 5 sections, 8 figures, 4 tables.

Figures (8)

  • Figure 1: Task-level MoE model, with LP - based routing; each Language Pair is routed through a top-2 router to an expert in the model experts' layer. From a pretrained Task-level MoE model we can extract a smaller dense network specializing in a certain task, e.g. ja-ko NMT.
  • Figure 2: Difference between BLEU scores of Task-level MoE models trained with 16 and 64 experts, for direct pairs, for models with Language Pair (LP) - based routing during training, lp_a, lp_b, lp_c mapping during inference, respectively, and models with Target Language (TL) - based routing during training, tl_a, tl_b mapping during inference. For each pair, we mark the BLEU score value of the model which shows the largest difference among the 16 and 64 experts Task-level MoE variation.
  • Figure 3: Routing decisions of the last encoder layer of our Task-level MoE model with 16 experts, trained with pair target language to task id mapping, with tl_a used during inference, for 1M steps.
  • Figure 4: Routing decisions of the last decoder layer of our Task-level MoE model with 16 experts, trained with pair target language to task id mapping, with tl_a used during inference, for 1M steps.
  • Figure 5: Routing decisions of the last encoder layer of our Task-level MoE model with 16 experts, trained with pair target language to task id mapping, with tl_a used during inference, for 2M steps.
  • ...and 3 more figures