Table of Contents
Fetching ...

ExPosST: Explicit Positioning with Adaptive Masking for LLM-Based Simultaneous Machine Translation

Yuzhe Shang, Pengzhi Gao, Yazheng Yang, Jiayao Ma, Wei Liu, Jian Luan, Jingsong Su

Abstract

Large language models (LLMs) have recently demonstrated promising performance in simultaneous machine translation (SimulMT). However, applying decoder-only LLMs to SimulMT introduces a positional mismatch, which leads to a dilemma between decoding efficiency and positional consistency. Existing approaches often rely on specific positional encodings or carefully designed prompting schemes, and thus fail to simultaneously achieve inference efficiency, positional consistency, and broad model compatibility. In this work, we propose ExPosST, a general framework that resolves this dilemma through explicit position allocation. ExPosST reserves fixed positional slots for incoming source tokens, enabling efficient decoding with KV cache across different positional encoding methods. To further bridge the gap between fine-tuning and inference, we introduce a policy-consistent fine-tuning strategy that aligns training with inference-time decoding behavior. Experiments across multiple language pairs demonstrate that ExPosST effectively supports simultaneous translation under diverse policies.

ExPosST: Explicit Positioning with Adaptive Masking for LLM-Based Simultaneous Machine Translation

Abstract

Large language models (LLMs) have recently demonstrated promising performance in simultaneous machine translation (SimulMT). However, applying decoder-only LLMs to SimulMT introduces a positional mismatch, which leads to a dilemma between decoding efficiency and positional consistency. Existing approaches often rely on specific positional encodings or carefully designed prompting schemes, and thus fail to simultaneously achieve inference efficiency, positional consistency, and broad model compatibility. In this work, we propose ExPosST, a general framework that resolves this dilemma through explicit position allocation. ExPosST reserves fixed positional slots for incoming source tokens, enabling efficient decoding with KV cache across different positional encoding methods. To further bridge the gap between fine-tuning and inference, we introduce a policy-consistent fine-tuning strategy that aligns training with inference-time decoding behavior. Experiments across multiple language pairs demonstrate that ExPosST effectively supports simultaneous translation under diverse policies.
Paper Structure (18 sections, 1 equation, 9 figures, 2 tables)

This paper contains 18 sections, 1 equation, 9 figures, 2 tables.

Figures (9)

  • Figure 1: Comparison of different strategies for LLM-based SimulMT. $\langle u \rangle$ and $\langle a \rangle$ denote the user and assistant role prompts in the LLM conversational template. (a) Cache Reuse: Inserting a new source token ($s_3$) shifts the positional indices of subsequent tokens (e.g., $t_1$ shifts from 3 to 4), causing a mismatch with the existing KV cache. (b) Cache Recomputation: Positional consistency is restored by re-encoding the shifted tokens, but this incurs prohibitive computational latency. (c) ExPosST (Ours): By explicitly pre-allocating positional slots for potential source tokens, the positional indices of the target sequence remain invariant during READ/WRITE cycles, enabling efficient KV cache reuse without positional misalignment.
  • Figure 2: Overview of the ExPosST framework. Left: The Inference with Pre-allocated Positions strategy. We adopt a wait-2 policy and set the pre-allocated slot length $L_{slot}=3$. When $s_3$ is read, it fills the reserved slot without shifting the positions of generated target tokens (e.g., $t_1$). Upon reading $s_4$, a new source slot is allocated immediately after the current output to maintain positional consistency. Right: The Policy-Consistent Fine-tuning strategy. The source sentence is segmented into parts to match the inference slot. A policy-consistent attention mask (bottom right) is applied to ensure the visibility of source tokens aligns with the specific simultaneous policy.
  • Figure 3: Sensitivity analysis of the pre-allocated slot length $L_{slot}$ on the IWSLT 2017 En-De dev set, illustrating its impact on translation performance (BLEU score) and average training sequence length for the Llama-3.1-8B-Instruct model.
  • Figure 4: Main results on IWSLT 2017 tasks on Llama-3.1-8B-Instruct. The figures illustrate the BLEU-LAAL trade-off curves, comparing ExPosST with various baselines across two mainstream LLM architectures. The dashed horizontal lines indicate the performance of the corresponding offline models. Higher curves and those shifted toward the top-left represent a superior quality-latency balance.
  • Figure 5: BLEU-LAAL trade-off on IWSLT 2017 datasets on Qwen2.5-7B-Instruct.
  • ...and 4 more figures