Table of Contents
Fetching ...

Efficient and Adaptive Simultaneous Speech Translation with Fully Unidirectional Architecture

Biao Fu, Donglei Yu, Minpeng Liao, Chengxi Li, Yidong Chen, Kai Fan, Xiaodong Shi

TL;DR

This work tackles the challenge of efficient and accurate simultaneous speech translation by introducing EASiST, a fully unidirectional end-to-end framework that unifies streaming speech encoding with LLM-based translation. It features a data curation pipeline that creates multi-latency, semantically aligned interleaved speech-translation samples, a lightweight policy head for adaptive read/write decisions, and a three-stage training procedure that progressively aligns modalities and optimizes translation with policy learning. Empirical results on MuST-C En→De and En→Es demonstrate superior latency-quality trade-offs and robust inference efficiency, surpassing fixed-policy and some adaptive baselines while maintaining competitive offline translation performance. The approach offers practical impact for real-time translation systems by enabling cache-friendly, low-latency inference in an end-to-end architecture, with potential extensions to longer-form input and broader language directions.

Abstract

Simultaneous speech translation (SimulST) produces translations incrementally while processing partial speech input. Although large language models (LLMs) have showcased strong capabilities in offline translation tasks, applying them to SimulST poses notable challenges. Existing LLM-based SimulST approaches either incur significant computational overhead due to repeated encoding of bidirectional speech encoder, or they depend on a fixed read/write policy, limiting the efficiency and performance. In this work, we introduce Efficient and Adaptive Simultaneous Speech Translation (EASiST) with fully unidirectional architecture, including both speech encoder and LLM. EASiST includes a multi-latency data curation strategy to generate semantically aligned SimulST training samples and redefines SimulST as an interleaved generation task with explicit read/write tokens. To facilitate adaptive inference, we incorporate a lightweight policy head that dynamically predicts read/write actions. Additionally, we employ a multi-stage training strategy to align speech-text modalities and optimize both translation and policy behavior. Experiments on the MuST-C En$\rightarrow$De and En$\rightarrow$Es datasets demonstrate that EASiST offers superior latency-quality trade-offs compared to several strong baselines.

Efficient and Adaptive Simultaneous Speech Translation with Fully Unidirectional Architecture

TL;DR

This work tackles the challenge of efficient and accurate simultaneous speech translation by introducing EASiST, a fully unidirectional end-to-end framework that unifies streaming speech encoding with LLM-based translation. It features a data curation pipeline that creates multi-latency, semantically aligned interleaved speech-translation samples, a lightweight policy head for adaptive read/write decisions, and a three-stage training procedure that progressively aligns modalities and optimizes translation with policy learning. Empirical results on MuST-C En→De and En→Es demonstrate superior latency-quality trade-offs and robust inference efficiency, surpassing fixed-policy and some adaptive baselines while maintaining competitive offline translation performance. The approach offers practical impact for real-time translation systems by enabling cache-friendly, low-latency inference in an end-to-end architecture, with potential extensions to longer-form input and broader language directions.

Abstract

Simultaneous speech translation (SimulST) produces translations incrementally while processing partial speech input. Although large language models (LLMs) have showcased strong capabilities in offline translation tasks, applying them to SimulST poses notable challenges. Existing LLM-based SimulST approaches either incur significant computational overhead due to repeated encoding of bidirectional speech encoder, or they depend on a fixed read/write policy, limiting the efficiency and performance. In this work, we introduce Efficient and Adaptive Simultaneous Speech Translation (EASiST) with fully unidirectional architecture, including both speech encoder and LLM. EASiST includes a multi-latency data curation strategy to generate semantically aligned SimulST training samples and redefines SimulST as an interleaved generation task with explicit read/write tokens. To facilitate adaptive inference, we incorporate a lightweight policy head that dynamically predicts read/write actions. Additionally, we employ a multi-stage training strategy to align speech-text modalities and optimize both translation and policy behavior. Experiments on the MuST-C EnDe and EnEs datasets demonstrate that EASiST offers superior latency-quality trade-offs compared to several strong baselines.

Paper Structure

This paper contains 23 sections, 11 equations, 8 figures, 8 tables.

Figures (8)

  • Figure 1: Overview of the proposed EASiST framework. Bottom: SimulST data curation pipeline that generates monotonic interleaved SimulST data from offline ST corpora. Top: A three-stage training strategy—(I) MT pre-training on SimulMT and offline MT data, (II) speech-text modality alignment via offline ST task, and (III) multi-task SFT for optimizing SimulST and adaptive read/write policy.
  • Figure 2: The translation quality (BLEU) against the latency metrics (LAAL) on the tst-COMMON set of MuST-C En$\rightarrow$De and En$\rightarrow$Es datasets.
  • Figure 3: The translation quality (BLEU) against the computational-aware latency metrics (LAAL-CA) on the tst-COMMON set of MuST-C En$\rightarrow$De/Es datasets.
  • Figure 4: Ablation study of our approach on the tst-COMMON set of MuST-C En$\rightarrow$De dataset.
  • Figure 5: Training time for different ablation settings.
  • ...and 3 more figures