Conversational SimulMT: Efficient Simultaneous Translation with Large Language Models
Minghan Wang, Thuy-Trang Vu, Yuxia Wang, Ehsan Shareghi, Gholamreza Haffari
TL;DR
This work tackles the efficiency gap in LLM-based Simultaneous MT by introducing Conversational SimulMT, a multi-turn prompt framework that preserves translation history through appended, conversational prompts, enabling persistent $KV$-cache reuse. To train such prompts, the authors curate conversational SimulMT data from offline parallel corpora via an alignment-guided pipeline that builds monotonic dependency graphs, derives meta trajectories of READ/WRITE chunks, and augments them to cover diverse latency requirements. Empirical results on WMT15 De→En, IWSLT15 En→Vi, and MUST-C En→Zh show that conversational prompting yields higher translation quality than offline prompting at comparable latency and can approach or surpass specialized SimulMT models in speed, with robust generalization across LLM families. The approach also introduces a scalable data-curation and augmentation framework, and highlights practical considerations such as decoding speed metrics (WWT) and context utilization, offering a viable path toward real-time, high-quality LLM-driven simultaneous translation.
Abstract
Simultaneous machine translation (SimulMT) presents a challenging trade-off between translation quality and latency. Recent studies have shown that LLMs can achieve good performance in SimulMT tasks. However, this often comes at the expense of high inference cost and latency. In this paper, we propose a conversational SimulMT framework to enhance the inference efficiency of LLM-based SimulMT through multi-turn-dialogue-based decoding. Our experiments with Llama2-7b-chat on two SimulMT benchmarks demonstrate the superiority of LLM in translation quality while achieving comparable computational latency to specialized SimulMT models.
