Table of Contents
Fetching ...

Semi-Autoregressive Neural Machine Translation

Chunqi Wang, Ji Zhang, Haiqing Chen

TL;DR

The paper tackles slow decoding in autoregressive neural machine translation by proposing the Semi-Autoregressive Transformer (SAT), which generates targets in groups of size $K$ to enable parallelization while preserving global autoregressive dependencies. SAT combines a group-level chain rule, long-distance prediction, and a relaxed causal mask to interpolate between Transformer ($K=1$) and non-autoregressive models ($K\ge n$), achieving substantial speedups with limited BLEU loss. Empirical results on WMT'14 English–German and Chinese–English demonstrate up to 5.58× speedup under greedy decoding with minimal quality degradation, and near-lossless generation for small $K$ (e.g., $K=2$). The work highlights the benefits of knowledge distillation, initialization from a pretrained Transformer, and the importance of modeling long-distance dependencies, while suggesting directions for further improvement in training objectives and dynamic grouping.

Abstract

Existing approaches to neural machine translation are typically autoregressive models. While these models attain state-of-the-art translation quality, they are suffering from low parallelizability and thus slow at decoding long sequences. In this paper, we propose a novel model for fast sequence generation --- the semi-autoregressive Transformer (SAT). The SAT keeps the autoregressive property in global but relieves in local and thus is able to produce multiple successive words in parallel at each time step. Experiments conducted on English-German and Chinese-English translation tasks show that the SAT achieves a good balance between translation quality and decoding speed. On WMT'14 English-German translation, the SAT achieves 5.58$\times$ speedup while maintains 88\% translation quality, significantly better than the previous non-autoregressive methods. When produces two words at each time step, the SAT is almost lossless (only 1\% degeneration in BLEU score).

Semi-Autoregressive Neural Machine Translation

TL;DR

The paper tackles slow decoding in autoregressive neural machine translation by proposing the Semi-Autoregressive Transformer (SAT), which generates targets in groups of size to enable parallelization while preserving global autoregressive dependencies. SAT combines a group-level chain rule, long-distance prediction, and a relaxed causal mask to interpolate between Transformer () and non-autoregressive models (), achieving substantial speedups with limited BLEU loss. Empirical results on WMT'14 English–German and Chinese–English demonstrate up to 5.58× speedup under greedy decoding with minimal quality degradation, and near-lossless generation for small (e.g., ). The work highlights the benefits of knowledge distillation, initialization from a pretrained Transformer, and the importance of modeling long-distance dependencies, while suggesting directions for further improvement in training objectives and dynamic grouping.

Abstract

Existing approaches to neural machine translation are typically autoregressive models. While these models attain state-of-the-art translation quality, they are suffering from low parallelizability and thus slow at decoding long sequences. In this paper, we propose a novel model for fast sequence generation --- the semi-autoregressive Transformer (SAT). The SAT keeps the autoregressive property in global but relieves in local and thus is able to produce multiple successive words in parallel at each time step. Experiments conducted on English-German and Chinese-English translation tasks show that the SAT achieves a good balance between translation quality and decoding speed. On WMT'14 English-German translation, the SAT achieves 5.58 speedup while maintains 88\% translation quality, significantly better than the previous non-autoregressive methods. When produces two words at each time step, the SAT is almost lossless (only 1\% degeneration in BLEU score).

Paper Structure

This paper contains 16 sections, 9 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: The different levels of autoregressive properties. Lines with arrow indicate dependencies. We mark the longest dependency path with bold red lines. The length of the longest dependency path decreases as we relieve the autoregressive property. An extreme case is non-autoregressive, where there is no dependency at all.
  • Figure 2: The architecture of the Transformer, also of the SAT, where the red dashed boxes point out the different parts of these two models.
  • Figure 3: Short-distance prediction (top) and long-distance prediction (bottom).
  • Figure 4: Strict causal mask (left) and relaxed causal mask (right) when the target length $n=6$ and the group size $K=2$. We mark their differences in bold.
  • Figure 5: Performance of the SAT with and without sequence-level knowledge distillation.
  • ...and 1 more figures