Table of Contents
Fetching ...

Conversational SimulMT: Efficient Simultaneous Translation with Large Language Models

Minghan Wang, Thuy-Trang Vu, Yuxia Wang, Ehsan Shareghi, Gholamreza Haffari

TL;DR

This work tackles the efficiency gap in LLM-based Simultaneous MT by introducing Conversational SimulMT, a multi-turn prompt framework that preserves translation history through appended, conversational prompts, enabling persistent $KV$-cache reuse. To train such prompts, the authors curate conversational SimulMT data from offline parallel corpora via an alignment-guided pipeline that builds monotonic dependency graphs, derives meta trajectories of READ/WRITE chunks, and augments them to cover diverse latency requirements. Empirical results on WMT15 De→En, IWSLT15 En→Vi, and MUST-C En→Zh show that conversational prompting yields higher translation quality than offline prompting at comparable latency and can approach or surpass specialized SimulMT models in speed, with robust generalization across LLM families. The approach also introduces a scalable data-curation and augmentation framework, and highlights practical considerations such as decoding speed metrics (WWT) and context utilization, offering a viable path toward real-time, high-quality LLM-driven simultaneous translation.

Abstract

Simultaneous machine translation (SimulMT) presents a challenging trade-off between translation quality and latency. Recent studies have shown that LLMs can achieve good performance in SimulMT tasks. However, this often comes at the expense of high inference cost and latency. In this paper, we propose a conversational SimulMT framework to enhance the inference efficiency of LLM-based SimulMT through multi-turn-dialogue-based decoding. Our experiments with Llama2-7b-chat on two SimulMT benchmarks demonstrate the superiority of LLM in translation quality while achieving comparable computational latency to specialized SimulMT models.

Conversational SimulMT: Efficient Simultaneous Translation with Large Language Models

TL;DR

This work tackles the efficiency gap in LLM-based Simultaneous MT by introducing Conversational SimulMT, a multi-turn prompt framework that preserves translation history through appended, conversational prompts, enabling persistent -cache reuse. To train such prompts, the authors curate conversational SimulMT data from offline parallel corpora via an alignment-guided pipeline that builds monotonic dependency graphs, derives meta trajectories of READ/WRITE chunks, and augments them to cover diverse latency requirements. Empirical results on WMT15 De→En, IWSLT15 En→Vi, and MUST-C En→Zh show that conversational prompting yields higher translation quality than offline prompting at comparable latency and can approach or surpass specialized SimulMT models in speed, with robust generalization across LLM families. The approach also introduces a scalable data-curation and augmentation framework, and highlights practical considerations such as decoding speed metrics (WWT) and context utilization, offering a viable path toward real-time, high-quality LLM-driven simultaneous translation.

Abstract

Simultaneous machine translation (SimulMT) presents a challenging trade-off between translation quality and latency. Recent studies have shown that LLMs can achieve good performance in SimulMT tasks. However, this often comes at the expense of high inference cost and latency. In this paper, we propose a conversational SimulMT framework to enhance the inference efficiency of LLM-based SimulMT through multi-turn-dialogue-based decoding. Our experiments with Llama2-7b-chat on two SimulMT benchmarks demonstrate the superiority of LLM in translation quality while achieving comparable computational latency to specialized SimulMT models.
Paper Structure (43 sections, 7 figures, 2 tables, 1 algorithm)

This paper contains 43 sections, 7 figures, 2 tables, 1 algorithm.

Figures (7)

  • Figure 1: Comparison of offline prompt (left) and conversational prompt (right). Offline prompt inserts tokens mid-sequence, preventing KV-cache reuse (red X), while conversational prompt appends content sequentially, enabling efficient cache utilization (blue blocks).
  • Figure 2: The illustration of the data curating process. The first graph is obtained from fast_align, it is then modified into a monotonic dependency graph by adding additional edges. The Meta Trajectory can be derived by segmenting the monotonic dependency graph with minimal dependency (segment with the colored solid line in step 3). Finally, Policy Generalization is applied to augment the segmented graph with merge (red dotted lines will be removed) and shift (blue dotted lines are shifted) operations. Chunks in the trajectories derived from the third and fourth graphs are highlighted with different colors.
  • Figure 3: Translation quality and latency results on three benchmarks. Results are presented in three groups with different colors: (i) Encoder-Decoder Transformer baselines (orange), (ii) Offline-Prompt LLMs (blue), and (iii) Conversation-Prompt LLMs (red). Offline and Simultaneous decoding are distinguished by the first letter (O/S).
  • Figure 4: Relationship between computational efficiency (Word Wall Time) and translation quality (COMET score) on WMT15 De->En. Simultaneous decoding settings are shown as circles, with circle size representing variance across different latency control parameters (e.g. $n$). Offline settings are represented by diamonds. Color coding matches \ref{['fig:main_result']}, with our proposed approach highlighted in bold.
  • Figure 5: Effect of trajectory augmentation strategies on translation quality (BLEU) and latency (AL) for WMT15 De->En. Results compare models trained on meta-trajectories alone versus with merge and shift operations.
  • ...and 2 more figures