RosettaSpeech: Zero-Shot Speech-to-Speech Translation from Monolingual Data
Zhisheng Zheng, Xiaohang Sun, Tuan Dinh, Abhishek Yanamandra, Abhinav Jain, Zhu Liu, Sunil Hadap, Vimal Bhat, Manoj Aggarwal, Gerard Medioni, David Harwath
TL;DR
RosettaSpeech tackles the scarcity of parallel S2ST data by enabling zero-shot speech-to-speech translation trained solely on monolingual speech-text data augmented with NMT supervision. It fuses a Whisper-based speech encoder, a CosyVoice2 speech tokenizer, and a multilingual LLM backbone with multi-head projections to jointly generate text and discrete speech tokens, then uses a flow-based vocoder pipeline for synthesis. The approach outperforms prior zero-shot and some supervised baselines on CVSS-C FR/DE/ES→EN, and demonstrations include many-to-one translation with a single model, illustrating data-efficient scalability. This work offers a practical, scalable pathway to speaker-preserving S2ST across many languages by leveraging abundant text data instead of costly parallel speech data, with clear directions for extension to broader language sets and bidirectional translation.
Abstract
The scarcity of parallel speech corpora critically hampers speech-to-speech translation (S2ST), often forcing reliance on complex, multi-stage pipelines. This paper introduces RosettaSpeech, a novel and simplified framework for zero-shot S2ST that is trained on monolingual speech-text data augmented by machine translation supervision. While our method leverages the linguistic knowledge inherent in text-based NMT models, it strictly eliminates the need for parallel speech-to-speech pairs. Our model uniquely uses text as an intermediate bridge during training but functions as a direct, end-to-end speech-to-speech model at inference. This streamlined approach achieves state-of-the-art results on standard benchmarks. For instance, on the CVSS-C test set, RosettaSpeech outperforms leading systems, achieving an ASR-BLEU score of 25.17 for German-to-English and 29.86 for Spanish-to-English-relative gains of over 27% and 14%, respectively. Furthermore, we demonstrate that a single model can deliver strong many-to-one translation performance (FR/ES/DE -> EN). We also provide a foundational analysis of how training data scaling impacts model performance. By prioritizing reliance on abundant parallel text rather than difficult-to-acquire parallel speech, RosettaSpeech offers a scalable path to creating high-quality, speaker-preserving S2ST for a much broader array of languages.
