Table of Contents
Fetching ...

R-BI: Regularized Batched Inputs enhance Incremental Decoding Framework for Low-Latency Simultaneous Speech Translation

Jiaxin Guo, Zhanglin Wu, Zongyao Li, Hengchao Shang, Daimeng Wei, Xiaoyu Chen, Zhiqiang Rao, Shaojun Li, Hao Yang

TL;DR

This work introduces Regularized Batched Inputs (R-BI), a flexible policy to enhance incremental decoding for low-latency Simultaneous Speech Translation by diversifying inputs at each decoding step and selecting stable prefixes. By applying regularization to speech inputs (end-to-end) or ASR-generated text (cascaded), R-BI reduces output errors when inputs are incomplete, achieving low latency with BLEU losses of at most 2 points versus offline systems and attaining state-of-the-art results on several IWSLT SimulST directions. The approach comprises end-to-end augmentations (time stretching/shifting, volume, noise, and masking) and cascaded strategies using U2-style hybrid ASR with multiple decoding modes, enabling robust cross-system applicability. Experiments on MuST-C and related ASR/MT datasets demonstrate strong improvements over prior policies (Hold-$n$, LA-$n$) and favorable comparisons to OfflineST baselines, while revealing limitations related to ASR accuracy and real-world translation discontinuities. Overall, R-BI provides a universal framework to translate OfflineST models into efficient SimulST deployments with controlled latency and competitive translation quality.

Abstract

Incremental Decoding is an effective framework that enables the use of an offline model in a simultaneous setting without modifying the original model, making it suitable for Low-Latency Simultaneous Speech Translation. However, this framework may introduce errors when the system outputs from incomplete input. To reduce these output errors, several strategies such as Hold-$n$, LA-$n$, and SP-$n$ can be employed, but the hyper-parameter $n$ needs to be carefully selected for optimal performance. Moreover, these strategies are more suitable for end-to-end systems than cascade systems. In our paper, we propose a new adaptable and efficient policy named "Regularized Batched Inputs". Our method stands out by enhancing input diversity to mitigate output errors. We suggest particular regularization techniques for both end-to-end and cascade systems. We conducted experiments on IWSLT Simultaneous Speech Translation (SimulST) tasks, which demonstrate that our approach achieves low latency while maintaining no more than 2 BLEU points loss compared to offline systems. Furthermore, our SimulST systems attained several new state-of-the-art results in various language directions.

R-BI: Regularized Batched Inputs enhance Incremental Decoding Framework for Low-Latency Simultaneous Speech Translation

TL;DR

This work introduces Regularized Batched Inputs (R-BI), a flexible policy to enhance incremental decoding for low-latency Simultaneous Speech Translation by diversifying inputs at each decoding step and selecting stable prefixes. By applying regularization to speech inputs (end-to-end) or ASR-generated text (cascaded), R-BI reduces output errors when inputs are incomplete, achieving low latency with BLEU losses of at most 2 points versus offline systems and attaining state-of-the-art results on several IWSLT SimulST directions. The approach comprises end-to-end augmentations (time stretching/shifting, volume, noise, and masking) and cascaded strategies using U2-style hybrid ASR with multiple decoding modes, enabling robust cross-system applicability. Experiments on MuST-C and related ASR/MT datasets demonstrate strong improvements over prior policies (Hold-, LA-) and favorable comparisons to OfflineST baselines, while revealing limitations related to ASR accuracy and real-world translation discontinuities. Overall, R-BI provides a universal framework to translate OfflineST models into efficient SimulST deployments with controlled latency and competitive translation quality.

Abstract

Incremental Decoding is an effective framework that enables the use of an offline model in a simultaneous setting without modifying the original model, making it suitable for Low-Latency Simultaneous Speech Translation. However, this framework may introduce errors when the system outputs from incomplete input. To reduce these output errors, several strategies such as Hold-, LA-, and SP- can be employed, but the hyper-parameter needs to be carefully selected for optimal performance. Moreover, these strategies are more suitable for end-to-end systems than cascade systems. In our paper, we propose a new adaptable and efficient policy named "Regularized Batched Inputs". Our method stands out by enhancing input diversity to mitigate output errors. We suggest particular regularization techniques for both end-to-end and cascade systems. We conducted experiments on IWSLT Simultaneous Speech Translation (SimulST) tasks, which demonstrate that our approach achieves low latency while maintaining no more than 2 BLEU points loss compared to offline systems. Furthermore, our SimulST systems attained several new state-of-the-art results in various language directions.
Paper Structure (33 sections, 3 equations, 6 figures, 3 tables)

This paper contains 33 sections, 3 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Incremental Decoding Framework
  • Figure 2: The difference between Hold-$n$DBLP:conf/interspeech/LiuSN20, LA-$n$DBLP:conf/interspeech/LiuSN20 and our proposed R-BI. For example, (a) Hold-1 (b) LA-2 (c) R-BI with batch size = 2
  • Figure 3: End-to-End and Cascaded Systems of R-BI.
  • Figure 4: Different waveforms under different Regularization Methods for End-to-End System
  • Figure 5: U2 Model: hybrid ASR structure combines a CTC decoder with an AED decoder
  • ...and 1 more figures