Table of Contents
Fetching ...

Glancing Future for Simultaneous Machine Translation

Shoutao Guo, Shaolei Zhang, Yang Feng

TL;DR

This work addresses the gap between seq2seq and prefix2prefix training in simultaneous machine translation by introducing glancing future training, which gradually exposes target tokens to future source information through a curriculum. The method defines an adjustable future-attention window with $\hat{g}_i = g_i + f_i$ and a decaying exposure parameter $\alpha$, enabling a smooth transition from full-sentence to low-latency training. Integrated with both fixed (Wait-$k$) and adaptive (HMT) SiMT policies, the approach yields improvements in BLEU and reduces hallucinations across En→Vi and De→En tasks. The findings offer practical guidance on selecting adjacent future information and demonstrate that bridging the training gap can enhance global-information usage without destabilizing latency. The work thus provides a versatile, training-time mechanism to strengthen SiMT models in streaming scenarios.

Abstract

Simultaneous machine translation (SiMT) outputs translation while reading the source sentence. Unlike conventional sequence-to-sequence (seq2seq) training, existing SiMT methods adopt the prefix-to-prefix (prefix2prefix) training, where the model predicts target tokens based on partial source tokens. However, the prefix2prefix training diminishes the ability of the model to capture global information and introduces forced predictions due to the absence of essential source information. Consequently, it is crucial to bridge the gap between the prefix2prefix training and seq2seq training to enhance the translation capability of the SiMT model. In this paper, we propose a novel method that glances future in curriculum learning to achieve the transition from the seq2seq training to prefix2prefix training. Specifically, we gradually reduce the available source information from the whole sentence to the prefix corresponding to that latency. Our method is applicable to a wide range of SiMT methods and experiments demonstrate that our method outperforms strong baselines.

Glancing Future for Simultaneous Machine Translation

TL;DR

This work addresses the gap between seq2seq and prefix2prefix training in simultaneous machine translation by introducing glancing future training, which gradually exposes target tokens to future source information through a curriculum. The method defines an adjustable future-attention window with and a decaying exposure parameter , enabling a smooth transition from full-sentence to low-latency training. Integrated with both fixed (Wait-) and adaptive (HMT) SiMT policies, the approach yields improvements in BLEU and reduces hallucinations across En→Vi and De→En tasks. The findings offer practical guidance on selecting adjacent future information and demonstrate that bridging the training gap can enhance global-information usage without destabilizing latency. The work thus provides a versatile, training-time mechanism to strengthen SiMT models in streaming scenarios.

Abstract

Simultaneous machine translation (SiMT) outputs translation while reading the source sentence. Unlike conventional sequence-to-sequence (seq2seq) training, existing SiMT methods adopt the prefix-to-prefix (prefix2prefix) training, where the model predicts target tokens based on partial source tokens. However, the prefix2prefix training diminishes the ability of the model to capture global information and introduces forced predictions due to the absence of essential source information. Consequently, it is crucial to bridge the gap between the prefix2prefix training and seq2seq training to enhance the translation capability of the SiMT model. In this paper, we propose a novel method that glances future in curriculum learning to achieve the transition from the seq2seq training to prefix2prefix training. Specifically, we gradually reduce the available source information from the whole sentence to the prefix corresponding to that latency. Our method is applicable to a wide range of SiMT methods and experiments demonstrate that our method outperforms strong baselines.
Paper Structure (12 sections, 6 equations, 4 figures, 2 tables)

This paper contains 12 sections, 6 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Translation performance of the SiMT model when performing different wait-$k$ policies ($k_{test}\in\{1,3,5,7,9\}$) on WMT15 De$\rightarrow$En dataset. A larger $k_{test}$ results in the greater latency. The wait-$k$ policy ma2019stacl starts translation after reading $k$ tokens. The $k_{train}$ and $k_{test}$ represent the settings of $k$ for the wait-$k$ model during training and inference, respectively.
  • Figure 2: A German$\rightarrow$English example under different training frameworks. The SiMT model is trained to learn wait-$2$ policy ma2019stacl using the prefix2prefix training and glancing future training. $\alpha$ in our method is set to 0.75. The solid blue line represents the attention allowed by the policy. The red dashed line represents the extra attention allowed by our method.
  • Figure 3: Translation performance of SiMT methods on En$\rightarrow$Vi and De$\rightarrow$En.
  • Figure 4: Hallucination rate (HR) of different SiMT methods on De$\rightarrow$En task.