Table of Contents
Fetching ...

Multilingual Non-Autoregressive Machine Translation without Knowledge Distillation

Chenyang Huang, Fei Huang, Zaixiang Zheng, Osmar R. Zaïane, Hao Zhou, Lili Mou

TL;DR

This work tackles multilingual machine translation with non-autoregressive models by eliminating the need for knowledge distillation through a Directed Acyclic Transformer (DAT). It introduces Pivot Back-Translation (PivotBT) to improve generalization to unseen directions, enabling effective zero-shot performance. The proposed M-DAT achieves state-of-the-art results among NAT MNMT systems and even surpasses strong autoregressive baselines in zero-shot settings, while maintaining fast inference speeds. The approach reduces deployment complexity and demonstrates practical benefits for multilingual translation in latency-constrained scenarios.

Abstract

Multilingual neural machine translation (MNMT) aims at using one single model for multiple translation directions. Recent work applies non-autoregressive Transformers to improve the efficiency of MNMT, but requires expensive knowledge distillation (KD) processes. To this end, we propose an M-DAT approach to non-autoregressive multilingual machine translation. Our system leverages the recent advance of the directed acyclic Transformer (DAT), which does not require KD. We further propose a pivot back-translation (PivotBT) approach to improve the generalization to unseen translation directions. Experiments show that our M-DAT achieves state-of-the-art performance in non-autoregressive MNMT.

Multilingual Non-Autoregressive Machine Translation without Knowledge Distillation

TL;DR

This work tackles multilingual machine translation with non-autoregressive models by eliminating the need for knowledge distillation through a Directed Acyclic Transformer (DAT). It introduces Pivot Back-Translation (PivotBT) to improve generalization to unseen directions, enabling effective zero-shot performance. The proposed M-DAT achieves state-of-the-art results among NAT MNMT systems and even surpasses strong autoregressive baselines in zero-shot settings, while maintaining fast inference speeds. The approach reduces deployment complexity and demonstrates practical benefits for multilingual translation in latency-constrained scenarios.

Abstract

Multilingual neural machine translation (MNMT) aims at using one single model for multiple translation directions. Recent work applies non-autoregressive Transformers to improve the efficiency of MNMT, but requires expensive knowledge distillation (KD) processes. To this end, we propose an M-DAT approach to non-autoregressive multilingual machine translation. Our system leverages the recent advance of the directed acyclic Transformer (DAT), which does not require KD. We further propose a pivot back-translation (PivotBT) approach to improve the generalization to unseen translation directions. Experiments show that our M-DAT achieves state-of-the-art performance in non-autoregressive MNMT.

Paper Structure

This paper contains 14 sections, 4 equations, 2 figures, 8 tables.

Figures (2)

  • Figure 1: An example of our PivotBT augmenting a German sentence $\mathbf y$ to Romanian $\hat{\mathbf{x}}$, where English is used as the pivot language. The training and back-translation steps are accomplished by DAT dag. $N_{\text{dec}}$ is the number of decoding layers.
  • Figure 2: Comparison between M-DAT and Switch-GLAT in the preservation ratio of low-frequency words.