Table of Contents
Fetching ...

Imitation Learning for Non-Autoregressive Neural Machine Translation

Bingzhen Wei, Mingxuan Wang, Hao Zhou, Junyang Lin, Jun Xie, Xu Sun

TL;DR

The paper tackles the accuracy gap between non-autoregressive translation (NAT) and autoregressive translation (AT) by introducing imitation learning in a NAT framework. An autoregressive demonstrator guides each decoding state of the NAT learner, enabling rich supervision without sacrificing NAT's parallel decoding speed.Key contributions include a differentiable Soft Copy mechanism, action distribution regularization, and a training objective that blends imitation signals with standard NAT/AT losses; experiments on IWSLT16, WMT14, and WMT16 show BLEU scores approaching autoregressive results while delivering substantial speedups. The approach demonstrates that imitation learning can effectively bridge the NAT-AT gap and offers practical implications for fast, high-quality translation in real-time systems.

Abstract

Non-autoregressive translation models (NAT) have achieved impressive inference speedup. A potential issue of the existing NAT algorithms, however, is that the decoding is conducted in parallel, without directly considering previous context. In this paper, we propose an imitation learning framework for non-autoregressive machine translation, which still enjoys the fast translation speed but gives comparable translation performance compared to its auto-regressive counterpart. We conduct experiments on the IWSLT16, WMT14 and WMT16 datasets. Our proposed model achieves a significant speedup over the autoregressive models, while keeping the translation quality comparable to the autoregressive models. By sampling sentence length in parallel at inference time, we achieve the performance of 31.85 BLEU on WMT16 Ro$\rightarrow$En and 30.68 BLEU on IWSLT16 En$\rightarrow$De.

Imitation Learning for Non-Autoregressive Neural Machine Translation

TL;DR

The paper tackles the accuracy gap between non-autoregressive translation (NAT) and autoregressive translation (AT) by introducing imitation learning in a NAT framework. An autoregressive demonstrator guides each decoding state of the NAT learner, enabling rich supervision without sacrificing NAT's parallel decoding speed.Key contributions include a differentiable Soft Copy mechanism, action distribution regularization, and a training objective that blends imitation signals with standard NAT/AT losses; experiments on IWSLT16, WMT14, and WMT16 show BLEU scores approaching autoregressive results while delivering substantial speedups. The approach demonstrates that imitation learning can effectively bridge the NAT-AT gap and offers practical implications for fast, high-quality translation in real-time systems.

Abstract

Non-autoregressive translation models (NAT) have achieved impressive inference speedup. A potential issue of the existing NAT algorithms, however, is that the decoding is conducted in parallel, without directly considering previous context. In this paper, we propose an imitation learning framework for non-autoregressive machine translation, which still enjoys the fast translation speed but gives comparable translation performance compared to its auto-regressive counterpart. We conduct experiments on the IWSLT16, WMT14 and WMT16 datasets. Our proposed model achieves a significant speedup over the autoregressive models, while keeping the translation quality comparable to the autoregressive models. By sampling sentence length in parallel at inference time, we achieve the performance of 31.85 BLEU on WMT16 RoEn and 30.68 BLEU on IWSLT16 EnDe.

Paper Structure

This paper contains 30 sections, 14 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Neural architectures for Autoregressive NMT and Non-Autoregressive NMT.
  • Figure 2: Illustration of the proposed model, where the black solid arrows represent differentiable connections and the dashed arrows are non-differentiable operations. Without loss of generality, this figure shows the case of T=3, T'=4. The left side of the figure is the DAT model and the right side is the imitate-NAT . The bottom is the encoder and the top is the decoder. The internal details of Imitation Module are shown in Figure \ref{['fig:action']}.
  • Figure 3: The imitation module of AT demonstrator and NAT learner.
  • Figure 4: Action category assignment distribution. Redistribute method leads to a more balanced distribution(blue), otherwise, it will be extremely unbalanced(red).