Speech-to-Speech Translation with Discrete-Unit-Based Style Transfer
Yongqi Wang, Jionghao Bai, Rongjie Huang, Ruiqi Li, Zhiqing Hong, Zhou Zhao
TL;DR
The paper presents a direct speech-to-speech translation framework with cross-lingual style transfer that leverages discrete self-supervised representations for semantic content and detailed acoustic information. It decouples the pipeline into three stages—speech-to-semantic-unit translation, acoustic unit modeling with an in-context learning-based acoustic language model, and unit-to-wave generation using a GAN-based unit vocoder—enabling style transfer without speaker-parallel data and zero-shot generalization to unseen source languages. Key contributions include the modular S2ST design compatible with existing S2UT models, an acoustic LM trained with in-context learning to capture voice style, and demonstrated improvements in speech quality and speaker similarity, outperforming cascaded baselines. The work enables practical, style-preserving multilingual translations and highlights the potential and risks of scalable style transfer in S2ST systems.
Abstract
Direct speech-to-speech translation (S2ST) with discrete self-supervised representations has achieved remarkable accuracy, but is unable to preserve the speaker timbre of the source speech. Meanwhile, the scarcity of high-quality speaker-parallel data poses a challenge for learning style transfer during translation. We design an S2ST pipeline with style-transfer capability on the basis of discrete self-supervised speech representations and codec units. The acoustic language model we introduce for style transfer leverages self-supervised in-context learning, acquiring style transfer ability without relying on any speaker-parallel data, thereby overcoming data scarcity. By using extensive training data, our model achieves zero-shot cross-lingual style transfer on previously unseen source languages. Experiments show that our model generates translated speeches with high fidelity and speaker similarity. Audio samples are available at http://stylelm.github.io/ .
