Non-Monotonic Attention-based Read/Write Policy Learning for Simultaneous Translation
Zeeshan Ahmed, Frank Seide, Zhe Liu, Rastislav Rabatin, Jachym Kolar, Niko Moritz, Ruiming Xie, Simone Merello, Christian Fuegen
TL;DR
The paper tackles the latency–quality trade-off in simultaneous translation by converting a high-quality non-streaming model into a streaming system. It introduces AliBaStr-MT, which freezes a pretrained full-sentence model and trains a lightweight read/write policy using alignment-derived pseudo-labels, with a cumulative attention threshold guiding decisions. A streaming beam search decoder and a tunable inference-time calibration ($\delta$) enable flexible latency control while maintaining competitive BLEU scores close to the non-streaming upper bound $BLEU$ and latency measured by $AL$. Compared with fixed-policy and other monotonic-attention approaches, AliBaStr-MT delivers improved efficiency and scales better during inference, demonstrating robust performance on real-life conversations and the FLEURS dataset. The work offers practical impact for real-time translation systems by providing a principled, alignment-driven, and tunable streaming solution.
Abstract
Simultaneous or streaming machine translation generates translation while reading the input stream. These systems face a quality/latency trade-off, aiming to achieve high translation quality similar to non-streaming models with minimal latency. We propose an approach that efficiently manages this trade-off. By enhancing a pretrained non-streaming model, which was trained with a seq2seq mechanism and represents the upper bound in quality, we convert it into a streaming model by utilizing the alignment between source and target tokens. This alignment is used to learn a read/write decision boundary for reliable translation generation with minimal input. During training, the model learns the decision boundary through a read/write policy module, employing supervised learning on the alignment points (pseudo labels). The read/write policy module, a small binary classification unit, can control the quality/latency trade-off during inference. Experimental results show that our model outperforms several strong baselines and narrows the gap with the non-streaming baseline model.
