Table of Contents
Fetching ...

SimulSense: Sense-Driven Interpreting for Efficient Simultaneous Speech Translation

Haotian Tan, Hiroki Ouchi, Sakriani Sakti

TL;DR

SimulSense tackles the quality-latency challenge in Simultaneous Speech Translation by replacing costly LLM-based decision policies with a lightweight Sense Units Detector that triggers translation when a semantically meaningful sense unit is detected. The system jointly trains a Sense-Aware Transducer with CIF-based segmentation to produce unit-level features aligned to target tokens, and relies on an offline speech translation model for translation, with a trigger rule based on accumulated weights and a threshold. Training optimizes a joint ASR loss plus alignment and language-model losses to encourage one-to-one mapping between sense units and tokens. Experiments across En→De/Ja/Zh show superior quality-latency trade-offs and major efficiency gains over strong baselines, demonstrating that high-quality simultaneous translation can be achieved with a real-time, LLM-free decision policy.

Abstract

How to make human-interpreter-like read/write decisions for simultaneous speech translation (SimulST) systems? Current state-of-the-art systems formulate SimulST as a multi-turn dialogue task, requiring specialized interleaved training data and relying on computationally expensive large language model (LLM) inference for decision-making. In this paper, we propose SimulSense, a novel framework for SimulST that mimics human interpreters by continuously reading input speech and triggering write decisions to produce translation when a new sense unit is perceived. Experiments against two state-of-the-art baseline systems demonstrate that our proposed method achieves a superior quality-latency tradeoff and substantially improved real-time efficiency, where its decision-making is up to 9.6x faster than the baselines.

SimulSense: Sense-Driven Interpreting for Efficient Simultaneous Speech Translation

TL;DR

SimulSense tackles the quality-latency challenge in Simultaneous Speech Translation by replacing costly LLM-based decision policies with a lightweight Sense Units Detector that triggers translation when a semantically meaningful sense unit is detected. The system jointly trains a Sense-Aware Transducer with CIF-based segmentation to produce unit-level features aligned to target tokens, and relies on an offline speech translation model for translation, with a trigger rule based on accumulated weights and a threshold. Training optimizes a joint ASR loss plus alignment and language-model losses to encourage one-to-one mapping between sense units and tokens. Experiments across En→De/Ja/Zh show superior quality-latency trade-offs and major efficiency gains over strong baselines, demonstrating that high-quality simultaneous translation can be achieved with a real-time, LLM-free decision policy.

Abstract

How to make human-interpreter-like read/write decisions for simultaneous speech translation (SimulST) systems? Current state-of-the-art systems formulate SimulST as a multi-turn dialogue task, requiring specialized interleaved training data and relying on computationally expensive large language model (LLM) inference for decision-making. In this paper, we propose SimulSense, a novel framework for SimulST that mimics human interpreters by continuously reading input speech and triggering write decisions to produce translation when a new sense unit is perceived. Experiments against two state-of-the-art baseline systems demonstrate that our proposed method achieves a superior quality-latency tradeoff and substantially improved real-time efficiency, where its decision-making is up to 9.6x faster than the baselines.

Paper Structure

This paper contains 20 sections, 7 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: The Sense Units Detector (SUD) assigns a weight to each acoustic feature of the input stream. When the accumulated weight sum (including a prior residual weight $r$) exceeds a triggering threshold $\gamma$, a sense unit is detected, which triggers the offline ST model to produce translations.
  • Figure 2: The overview of our SimulSense framework. Left:A Sense-Aware Transducer (SAT) training pipeline that guides a lightweight Sense Units Detector (SUD) model in learning human-interpreter-like sense unit segmentation. Right:The SUD model dynamically perceives sense units from acoustic features and drives the LLM-based offline ST model to perform simultaneous translation.
  • Figure 3: Quality-latency trade-off (latency in ms) compared to two state-of-the-art SimulST systems on the En$\rightarrow$De, En$\rightarrow$Ja, and En$\rightarrow$Zh directions of the ACL 60/60 test set. SimulSense achieves a superior quality-latency tradeoff across all language pairs.
  • Figure 4: The impact of using different latency tags (latency in ms).