Simultaneous Speech-to-Speech Translation Without Aligned Data

Tom Labiausse; Romain Fabre; Yannick Estève; Alexandre Défossez; Neil Zeghidour

Simultaneous Speech-to-Speech Translation Without Aligned Data

Tom Labiausse, Romain Fabre, Yannick Estève, Alexandre Défossez, Neil Zeghidour

TL;DR

Hibiki-Zero addresses the challenge of simultaneous speech translation without word-level alignments by learning from sentence-level aligned data and optimizing the translation policy with reinforcement learning based on BLEU-derived rewards. The system employs a decoder-only, multistream architecture that tokenizes speech with a neural audio codec (Mimi) and jointly models multiple streams for S2ST and S2TT, including an inner-monologue text stream and voice-transfer capabilities. Coarse sentence-level alignments and natural pauses in TTS enable scalable data generation across languages; reinforcement learning with GRPO-based optimization yields strong quality-latency trade-offs and robust performance on multilingual tasks, including adaptation to Italian with under 1000 hours of data. The work demonstrates state-of-the-art results on long- and short-form data, offers new language adaptation capabilities, and provides model weights, inference code, and a multilingual 45-hour benchmark to advance scalable S2ST research.

Abstract

Simultaneous speech translation requires translating source speech into a target language in real-time while handling non-monotonic word dependencies. Traditional approaches rely on supervised training with word-level aligned data, which is difficult to collect at scale and thus depends on synthetic alignments using language-specific heuristics that are suboptimal. We propose Hibiki-Zero, which eliminates the need for word-level alignments entirely. This fundamentally simplifies the training pipeline and enables seamless scaling to diverse languages with varying grammatical structures, removing the bottleneck of designing language-specific alignment heuristics. We first train on sentence-level aligned data to learn speech translation at high latency, then apply a novel reinforcement learning strategy using GRPO to optimize latency while preserving translation quality. Hibiki-Zero achieves state-of-the-art performance in translation accuracy, latency, voice transfer, and naturalness across five X-to-English tasks. Moreover, we demonstrate that our model can be adapted to support a new input language with less than 1000h of speech. We provide examples, model weights, inference code and we release a benchmark containing 45h of multilingual data for speech translation evaluation.

Simultaneous Speech-to-Speech Translation Without Aligned Data

TL;DR

Abstract

Paper Structure (43 sections, 9 equations, 7 figures, 6 tables)

This paper contains 43 sections, 9 equations, 7 figures, 6 tables.

Introduction
Related Work
Simultaneous end-to-end speech translation
Self-improvement of real-time translation systems
Method
Modeling
Neural audio codec
Joint modeling of discrete audio tokens
Translation as multistream modeling
Architectural details
Coarse alignment of speech translation data
Sentence-level alignment
Natural pauses TTS
Translation policy reinforcement
Process rewards
...and 28 more sections

Figures (7)

Figure 1: Architecture of the RQ-Transformer. Figure adapted from moshi.
Figure 2: Joint sequence modeling. From the source stream, Hibiki-Zero predicts its Inner Monologue text stream, semantic and acoustic tokens. Figure adapted from hibiki.
Figure 3: Process rewards method based on BLEU score. We introduce intermediate BLEU score computed on the text output of the model before a given frame $t$ and using the ground-truth translation of the corresponding input sentences processed so far. We combine it with the total output BLEU score using $\alpha \in [0,1]$.
Figure 4: Influence of hyperparameter $\alpha$ during RL. We plot the BLEU score and text LAAL over training for various $\alpha$ (see Eq. \ref{['eq:process_reward']}), starting from the same supervised model using $n_w=8$.
Figure 5: Illustration of coarse translation alignment patterns Waveform A is generated by a model trained on coarse alignments with random silences. Waveform B is generated by a model trained on coarse alignments with silences between sentences only.
...and 2 more figures

Simultaneous Speech-to-Speech Translation Without Aligned Data

TL;DR

Abstract

Simultaneous Speech-to-Speech Translation Without Aligned Data

Authors

TL;DR

Abstract

Table of Contents

Figures (7)