Table of Contents
Fetching ...

What Have We Achieved on Non-autoregressive Translation?

Yafu Li, Huajian Zhang, Jianhao Yan, Yongjing Yin, Yue Zhang

TL;DR

This paper evaluates how close non-autoregressive translation (NAT) approaches truly come to autoregressive translation (AT) beyond BLEU, using four representative NAT methods and human evaluation. By systematically comparing MgMO, CTC, DAT, and CMLM against AT on WMT benchmarks across automatic metrics, GPT-4–based judgments, and MQM human ratings, it shows that AT generally outperforms NAT on model-based and human-aligned metrics, even as some NAT variants approach AT on rule-based metrics. A key finding is that explicit modeling of target-side dependencies markedly improves translation fluency and generalization, while weaknesses in dependency modeling lead to repetitions, omissions, and spelling errors. The work highlights that advancing NAT requires stronger explicit dependency modeling without sacrificing decoding speed, guiding future research toward more faithful and robust one-shot translation systems.

Abstract

Recent advances have made non-autoregressive (NAT) translation comparable to autoregressive methods (AT). However, their evaluation using BLEU has been shown to weakly correlate with human annotations. Limited research compares non-autoregressive translation and autoregressive translation comprehensively, leaving uncertainty about the true proximity of NAT to AT. To address this gap, we systematically evaluate four representative NAT methods across various dimensions, including human evaluation. Our empirical results demonstrate that despite narrowing the performance gap, state-of-the-art NAT still underperforms AT under more reliable evaluation metrics. Furthermore, we discover that explicitly modeling dependencies is crucial for generating natural language and generalizing to out-of-distribution sequences.

What Have We Achieved on Non-autoregressive Translation?

TL;DR

This paper evaluates how close non-autoregressive translation (NAT) approaches truly come to autoregressive translation (AT) beyond BLEU, using four representative NAT methods and human evaluation. By systematically comparing MgMO, CTC, DAT, and CMLM against AT on WMT benchmarks across automatic metrics, GPT-4–based judgments, and MQM human ratings, it shows that AT generally outperforms NAT on model-based and human-aligned metrics, even as some NAT variants approach AT on rule-based metrics. A key finding is that explicit modeling of target-side dependencies markedly improves translation fluency and generalization, while weaknesses in dependency modeling lead to repetitions, omissions, and spelling errors. The work highlights that advancing NAT requires stronger explicit dependency modeling without sacrificing decoding speed, guiding future research toward more faithful and robust one-shot translation systems.

Abstract

Recent advances have made non-autoregressive (NAT) translation comparable to autoregressive methods (AT). However, their evaluation using BLEU has been shown to weakly correlate with human annotations. Limited research compares non-autoregressive translation and autoregressive translation comprehensively, leaving uncertainty about the true proximity of NAT to AT. To address this gap, we systematically evaluate four representative NAT methods across various dimensions, including human evaluation. Our empirical results demonstrate that despite narrowing the performance gap, state-of-the-art NAT still underperforms AT under more reliable evaluation metrics. Furthermore, we discover that explicitly modeling dependencies is crucial for generating natural language and generalizing to out-of-distribution sequences.
Paper Structure (41 sections, 15 equations, 5 figures, 16 tables)

This paper contains 41 sections, 15 equations, 5 figures, 16 tables.

Figures (5)

  • Figure 1: Heatmap visualization of MQM evaluation: darker colours indicate larger error counts for certain error types. The left side presents major-level errors while the right side shows minor-level errors.
  • Figure 2: N-gram repetition of different models (WMT21 De$\Rightarrow$En), where the x-axis represents the size of the n-gram and the y-axis represents the count.
  • Figure 3: Translation quality (COMET) w.r.t. source sequence length on WMT21 De$\Rightarrow$En.
  • Figure 4: Average cross-domain performance (COMET) of WMT21 De$\Rightarrow$En models on out-of-domain testsets.
  • Figure 5: Translation performance (COMET) decreases (%) on noisy testsets of WMT21 De$\Rightarrow$En, with darker colours indicating greater degradation.