Glancing Future for Simultaneous Machine Translation
Shoutao Guo, Shaolei Zhang, Yang Feng
TL;DR
This work addresses the gap between seq2seq and prefix2prefix training in simultaneous machine translation by introducing glancing future training, which gradually exposes target tokens to future source information through a curriculum. The method defines an adjustable future-attention window with $\hat{g}_i = g_i + f_i$ and a decaying exposure parameter $\alpha$, enabling a smooth transition from full-sentence to low-latency training. Integrated with both fixed (Wait-$k$) and adaptive (HMT) SiMT policies, the approach yields improvements in BLEU and reduces hallucinations across En→Vi and De→En tasks. The findings offer practical guidance on selecting adjacent future information and demonstrate that bridging the training gap can enhance global-information usage without destabilizing latency. The work thus provides a versatile, training-time mechanism to strengthen SiMT models in streaming scenarios.
Abstract
Simultaneous machine translation (SiMT) outputs translation while reading the source sentence. Unlike conventional sequence-to-sequence (seq2seq) training, existing SiMT methods adopt the prefix-to-prefix (prefix2prefix) training, where the model predicts target tokens based on partial source tokens. However, the prefix2prefix training diminishes the ability of the model to capture global information and introduces forced predictions due to the absence of essential source information. Consequently, it is crucial to bridge the gap between the prefix2prefix training and seq2seq training to enhance the translation capability of the SiMT model. In this paper, we propose a novel method that glances future in curriculum learning to achieve the transition from the seq2seq training to prefix2prefix training. Specifically, we gradually reduce the available source information from the whole sentence to the prefix corresponding to that latency. Our method is applicable to a wide range of SiMT methods and experiments demonstrate that our method outperforms strong baselines.
