Table of Contents
Fetching ...

LAVA NAT: A Non-Autoregressive Translation Model with Look-Around Decoding and Vocabulary Attention

Xiaoya Li, Yuxian Meng, Arianna Yuan, Fei Wu, Jiwei Li

TL;DR

The paper tackles the multimodality problem in non-autoregressive translation by introducing Look-Around (LA) decoding and Vocabulary Attention (VA). LA leverages neighbor tokens to guide the current-token prediction, while VA models long-range token dependencies by attending to the full vocabulary, and both are integrated within a Transformer-based NAT framework complemented by dynamic bidirectional decoding. Through extensive experiments on four benchmarks, LAVA NAT achieves competitive BLEU scores with substantial inference speedups compared to autoregressive models and previous NAT methods, and ablations confirm the effectiveness of VA, LA, and dynamic decoding. This work offers a practical, efficient NAT approach with strong translation quality and demonstrates the value of explicit position-token and token-token modeling in parallel decoding settings.

Abstract

Non-autoregressive translation (NAT) models generate multiple tokens in one forward pass and is highly efficient at inference stage compared with autoregressive translation (AT) methods. However, NAT models often suffer from the multimodality problem, i.e., generating duplicated tokens or missing tokens. In this paper, we propose two novel methods to address this issue, the Look-Around (LA) strategy and the Vocabulary Attention (VA) mechanism. The Look-Around strategy predicts the neighbor tokens in order to predict the current token, and the Vocabulary Attention models long-term token dependencies inside the decoder by attending the whole vocabulary for each position to acquire knowledge of which token is about to generate. %We also propose a dynamic bidirectional decoding approach to accelerate the inference process of the LAVA model while preserving the high-quality of the generated output. Our proposed model uses significantly less time during inference compared with autoregressive models and most other NAT models. Our experiments on four benchmarks (WMT14 En$\rightarrow$De, WMT14 De$\rightarrow$En, WMT16 Ro$\rightarrow$En and IWSLT14 De$\rightarrow$En) show that the proposed model achieves competitive performance compared with the state-of-the-art non-autoregressive and autoregressive models while significantly reducing the time cost in inference phase.

LAVA NAT: A Non-Autoregressive Translation Model with Look-Around Decoding and Vocabulary Attention

TL;DR

The paper tackles the multimodality problem in non-autoregressive translation by introducing Look-Around (LA) decoding and Vocabulary Attention (VA). LA leverages neighbor tokens to guide the current-token prediction, while VA models long-range token dependencies by attending to the full vocabulary, and both are integrated within a Transformer-based NAT framework complemented by dynamic bidirectional decoding. Through extensive experiments on four benchmarks, LAVA NAT achieves competitive BLEU scores with substantial inference speedups compared to autoregressive models and previous NAT methods, and ablations confirm the effectiveness of VA, LA, and dynamic decoding. This work offers a practical, efficient NAT approach with strong translation quality and demonstrates the value of explicit position-token and token-token modeling in parallel decoding settings.

Abstract

Non-autoregressive translation (NAT) models generate multiple tokens in one forward pass and is highly efficient at inference stage compared with autoregressive translation (AT) methods. However, NAT models often suffer from the multimodality problem, i.e., generating duplicated tokens or missing tokens. In this paper, we propose two novel methods to address this issue, the Look-Around (LA) strategy and the Vocabulary Attention (VA) mechanism. The Look-Around strategy predicts the neighbor tokens in order to predict the current token, and the Vocabulary Attention models long-term token dependencies inside the decoder by attending the whole vocabulary for each position to acquire knowledge of which token is about to generate. %We also propose a dynamic bidirectional decoding approach to accelerate the inference process of the LAVA model while preserving the high-quality of the generated output. Our proposed model uses significantly less time during inference compared with autoregressive models and most other NAT models. Our experiments on four benchmarks (WMT14 EnDe, WMT14 DeEn, WMT16 RoEn and IWSLT14 DeEn) show that the proposed model achieves competitive performance compared with the state-of-the-art non-autoregressive and autoregressive models while significantly reducing the time cost in inference phase.

Paper Structure

This paper contains 40 sections, 9 equations, 4 figures, 8 tables.

Figures (4)

  • Figure 1: An overview of the proposed NAT model architecture.
  • Figure 2: The overview of Look-Around Decoding. For each position, the decoder is required to first predict tokens on its left side and on its right side before generating the token at the current position. Then, the token for the current position is decoded by incorporating the tokens on both sides, with two gates controlling to what degree these tokens contribute to the current token.
  • Figure 3: Illustration of different decoding strategies. Note that the sequential decoding does not use the Look-Around mechanism.
  • Figure 4: Performances with respect to different sentence lengths.