SimulSense: Sense-Driven Interpreting for Efficient Simultaneous Speech Translation
Haotian Tan, Hiroki Ouchi, Sakriani Sakti
TL;DR
SimulSense tackles the quality-latency challenge in Simultaneous Speech Translation by replacing costly LLM-based decision policies with a lightweight Sense Units Detector that triggers translation when a semantically meaningful sense unit is detected. The system jointly trains a Sense-Aware Transducer with CIF-based segmentation to produce unit-level features aligned to target tokens, and relies on an offline speech translation model for translation, with a trigger rule based on accumulated weights and a threshold. Training optimizes a joint ASR loss plus alignment and language-model losses to encourage one-to-one mapping between sense units and tokens. Experiments across En→De/Ja/Zh show superior quality-latency trade-offs and major efficiency gains over strong baselines, demonstrating that high-quality simultaneous translation can be achieved with a real-time, LLM-free decision policy.
Abstract
How to make human-interpreter-like read/write decisions for simultaneous speech translation (SimulST) systems? Current state-of-the-art systems formulate SimulST as a multi-turn dialogue task, requiring specialized interleaved training data and relying on computationally expensive large language model (LLM) inference for decision-making. In this paper, we propose SimulSense, a novel framework for SimulST that mimics human interpreters by continuously reading input speech and triggering write decisions to produce translation when a new sense unit is perceived. Experiments against two state-of-the-art baseline systems demonstrate that our proposed method achieves a superior quality-latency tradeoff and substantially improved real-time efficiency, where its decision-making is up to 9.6x faster than the baselines.
