Table of Contents
Fetching ...

Non-Monotonic Attention-based Read/Write Policy Learning for Simultaneous Translation

Zeeshan Ahmed, Frank Seide, Zhe Liu, Rastislav Rabatin, Jachym Kolar, Niko Moritz, Ruiming Xie, Simone Merello, Christian Fuegen

TL;DR

The paper tackles the latency–quality trade-off in simultaneous translation by converting a high-quality non-streaming model into a streaming system. It introduces AliBaStr-MT, which freezes a pretrained full-sentence model and trains a lightweight read/write policy using alignment-derived pseudo-labels, with a cumulative attention threshold guiding decisions. A streaming beam search decoder and a tunable inference-time calibration ($\delta$) enable flexible latency control while maintaining competitive BLEU scores close to the non-streaming upper bound $BLEU$ and latency measured by $AL$. Compared with fixed-policy and other monotonic-attention approaches, AliBaStr-MT delivers improved efficiency and scales better during inference, demonstrating robust performance on real-life conversations and the FLEURS dataset. The work offers practical impact for real-time translation systems by providing a principled, alignment-driven, and tunable streaming solution.

Abstract

Simultaneous or streaming machine translation generates translation while reading the input stream. These systems face a quality/latency trade-off, aiming to achieve high translation quality similar to non-streaming models with minimal latency. We propose an approach that efficiently manages this trade-off. By enhancing a pretrained non-streaming model, which was trained with a seq2seq mechanism and represents the upper bound in quality, we convert it into a streaming model by utilizing the alignment between source and target tokens. This alignment is used to learn a read/write decision boundary for reliable translation generation with minimal input. During training, the model learns the decision boundary through a read/write policy module, employing supervised learning on the alignment points (pseudo labels). The read/write policy module, a small binary classification unit, can control the quality/latency trade-off during inference. Experimental results show that our model outperforms several strong baselines and narrows the gap with the non-streaming baseline model.

Non-Monotonic Attention-based Read/Write Policy Learning for Simultaneous Translation

TL;DR

The paper tackles the latency–quality trade-off in simultaneous translation by converting a high-quality non-streaming model into a streaming system. It introduces AliBaStr-MT, which freezes a pretrained full-sentence model and trains a lightweight read/write policy using alignment-derived pseudo-labels, with a cumulative attention threshold guiding decisions. A streaming beam search decoder and a tunable inference-time calibration () enable flexible latency control while maintaining competitive BLEU scores close to the non-streaming upper bound and latency measured by . Compared with fixed-policy and other monotonic-attention approaches, AliBaStr-MT delivers improved efficiency and scales better during inference, demonstrating robust performance on real-life conversations and the FLEURS dataset. The work offers practical impact for real-time translation systems by providing a principled, alignment-driven, and tunable streaming solution.

Abstract

Simultaneous or streaming machine translation generates translation while reading the input stream. These systems face a quality/latency trade-off, aiming to achieve high translation quality similar to non-streaming models with minimal latency. We propose an approach that efficiently manages this trade-off. By enhancing a pretrained non-streaming model, which was trained with a seq2seq mechanism and represents the upper bound in quality, we convert it into a streaming model by utilizing the alignment between source and target tokens. This alignment is used to learn a read/write decision boundary for reliable translation generation with minimal input. During training, the model learns the decision boundary through a read/write policy module, employing supervised learning on the alignment points (pseudo labels). The read/write policy module, a small binary classification unit, can control the quality/latency trade-off during inference. Experimental results show that our model outperforms several strong baselines and narrows the gap with the non-streaming baseline model.

Paper Structure

This paper contains 13 sections, 8 equations, 4 figures, 1 table.

Figures (4)

  • Figure 1: Read/Write Policy learning using a pretrained non-streaming encoder-decoder model. (a) A pretrained non-streaming model used for generating pseudo labels for training the policy network (this model is not trained). (b) Streaming model with Monotonic Attention and Policy Network. (c) The policy network is learned in a supervised manner using the read/write probability scores generated by Policy Label Generator.
  • Figure 2: Conversion of Attention weight matrix to Policy label matrix for supervised training of policy network.
  • Figure 3: Conversion of Attention weight matrix to Policy label matrix for supervised training of policy network.
  • Figure 4: BLEU (En-Es, Es-En) and Average Lag trend against the Read/Write module calibration threshold ($\delta$) on Real-life Conversation and FLEURS evaluation sets.