Table of Contents
Fetching ...

High-Fidelity Simultaneous Speech-To-Speech Translation

Tom Labiausse, Laurent Mazaré, Edouard Grave, Patrick Pérez, Alexandre Défossez, Neil Zeghidour

TL;DR

Hibiki addresses the challenge of high-quality simultaneous speech-to-speech translation by introducing a decoder-only, multistream architecture that jointly models source and target audio streams to emit text and audio tokens in real time. It combines a neural audio codec (Mimi) with a joint token model (RQ-Transformer) and a text stream (Inner Monologue) to enable causal, low-latency translation, augmented by contextual alignment learned from synthetic data. The paper introduces alignment-based data synthesis (contextual alignment, silence insertion, alignment-aware TTS) and voice-transfer conditioning with classifier-free guidance to improve speaker fidelity, achieving state-of-the-art translation quality and naturalness on French-English while supporting batched GPU inference and real-time on-device deployment. Practical impact includes scalable streaming deployment, high-quality interpretation-like experiences, and publicly released models and data, advancing real-time, high-fidelity S2ST technology with strong cross-lingual voice transfer capabilities.

Abstract

We introduce Hibiki, a decoder-only model for simultaneous speech translation. Hibiki leverages a multistream language model to synchronously process source and target speech, and jointly produces text and audio tokens to perform speech-to-text and speech-to-speech translation. We furthermore address the fundamental challenge of simultaneous interpretation, which unlike its consecutive counterpart, where one waits for the end of the source utterance to start translating, adapts its flow to accumulate just enough context to produce a correct translation in real-time, chunk by chunk. To do so, we introduce a weakly-supervised method that leverages the perplexity of an off-the-shelf text translation system to identify optimal delays on a per-word basis and create aligned synthetic data. After supervised training, Hibiki performs adaptive, simultaneous speech translation with vanilla temperature sampling. On a French-English simultaneous speech translation task, Hibiki demonstrates state-of-the-art performance in translation quality, speaker fidelity and naturalness. Moreover, the simplicity of its inference process makes it compatible with batched translation and even real-time on-device deployment. We provide examples as well as models and inference code.

High-Fidelity Simultaneous Speech-To-Speech Translation

TL;DR

Hibiki addresses the challenge of high-quality simultaneous speech-to-speech translation by introducing a decoder-only, multistream architecture that jointly models source and target audio streams to emit text and audio tokens in real time. It combines a neural audio codec (Mimi) with a joint token model (RQ-Transformer) and a text stream (Inner Monologue) to enable causal, low-latency translation, augmented by contextual alignment learned from synthetic data. The paper introduces alignment-based data synthesis (contextual alignment, silence insertion, alignment-aware TTS) and voice-transfer conditioning with classifier-free guidance to improve speaker fidelity, achieving state-of-the-art translation quality and naturalness on French-English while supporting batched GPU inference and real-time on-device deployment. Practical impact includes scalable streaming deployment, high-quality interpretation-like experiences, and publicly released models and data, advancing real-time, high-fidelity S2ST technology with strong cross-lingual voice transfer capabilities.

Abstract

We introduce Hibiki, a decoder-only model for simultaneous speech translation. Hibiki leverages a multistream language model to synchronously process source and target speech, and jointly produces text and audio tokens to perform speech-to-text and speech-to-speech translation. We furthermore address the fundamental challenge of simultaneous interpretation, which unlike its consecutive counterpart, where one waits for the end of the source utterance to start translating, adapts its flow to accumulate just enough context to produce a correct translation in real-time, chunk by chunk. To do so, we introduce a weakly-supervised method that leverages the perplexity of an off-the-shelf text translation system to identify optimal delays on a per-word basis and create aligned synthetic data. After supervised training, Hibiki performs adaptive, simultaneous speech translation with vanilla temperature sampling. On a French-English simultaneous speech translation task, Hibiki demonstrates state-of-the-art performance in translation quality, speaker fidelity and naturalness. Moreover, the simplicity of its inference process makes it compatible with batched translation and even real-time on-device deployment. We provide examples as well as models and inference code.

Paper Structure

This paper contains 35 sections, 8 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Generating aligned interpretations. We extract unsupervised word level contextual alignment, which we lift to audio by either inserting silences, or re-synthesising with an alignment aware TTS. See Section \ref{['sec:alignment']} for details.
  • Figure 2: Joint sequence modeling with contextual alignment. From the source stream, Hibiki predicts its Inner Monologue text stream, and audio tokens. Its output is aligned for causality, as depicted in Figure \ref{['fig:contextual-delays-waveform']}. Figure adapted from moshi.
  • Figure 3: Contextual alignment. We compute the log-likelihood of the word "into" with a pre-trained text translation model, for various input truncations. Once the matching source word "en" appears, we observe a large increase in log-likelihood, see eq. \ref{['eq:ctx_align']}.
  • Figure 4: Speaker similarity between source and target speech in CVSS-T training data, before and after resynthesis.
  • Figure 5: Batched inference speed of Hibiki on a H100 SXM.
  • ...and 2 more figures