Table of Contents
Fetching ...

Speech-to-Speech Translation with Discrete-Unit-Based Style Transfer

Yongqi Wang, Jionghao Bai, Rongjie Huang, Ruiqi Li, Zhiqing Hong, Zhou Zhao

TL;DR

The paper presents a direct speech-to-speech translation framework with cross-lingual style transfer that leverages discrete self-supervised representations for semantic content and detailed acoustic information. It decouples the pipeline into three stages—speech-to-semantic-unit translation, acoustic unit modeling with an in-context learning-based acoustic language model, and unit-to-wave generation using a GAN-based unit vocoder—enabling style transfer without speaker-parallel data and zero-shot generalization to unseen source languages. Key contributions include the modular S2ST design compatible with existing S2UT models, an acoustic LM trained with in-context learning to capture voice style, and demonstrated improvements in speech quality and speaker similarity, outperforming cascaded baselines. The work enables practical, style-preserving multilingual translations and highlights the potential and risks of scalable style transfer in S2ST systems.

Abstract

Direct speech-to-speech translation (S2ST) with discrete self-supervised representations has achieved remarkable accuracy, but is unable to preserve the speaker timbre of the source speech. Meanwhile, the scarcity of high-quality speaker-parallel data poses a challenge for learning style transfer during translation. We design an S2ST pipeline with style-transfer capability on the basis of discrete self-supervised speech representations and codec units. The acoustic language model we introduce for style transfer leverages self-supervised in-context learning, acquiring style transfer ability without relying on any speaker-parallel data, thereby overcoming data scarcity. By using extensive training data, our model achieves zero-shot cross-lingual style transfer on previously unseen source languages. Experiments show that our model generates translated speeches with high fidelity and speaker similarity. Audio samples are available at http://stylelm.github.io/ .

Speech-to-Speech Translation with Discrete-Unit-Based Style Transfer

TL;DR

The paper presents a direct speech-to-speech translation framework with cross-lingual style transfer that leverages discrete self-supervised representations for semantic content and detailed acoustic information. It decouples the pipeline into three stages—speech-to-semantic-unit translation, acoustic unit modeling with an in-context learning-based acoustic language model, and unit-to-wave generation using a GAN-based unit vocoder—enabling style transfer without speaker-parallel data and zero-shot generalization to unseen source languages. Key contributions include the modular S2ST design compatible with existing S2UT models, an acoustic LM trained with in-context learning to capture voice style, and demonstrated improvements in speech quality and speaker similarity, outperforming cascaded baselines. The work enables practical, style-preserving multilingual translations and highlights the potential and risks of scalable style transfer in S2ST systems.

Abstract

Direct speech-to-speech translation (S2ST) with discrete self-supervised representations has achieved remarkable accuracy, but is unable to preserve the speaker timbre of the source speech. Meanwhile, the scarcity of high-quality speaker-parallel data poses a challenge for learning style transfer during translation. We design an S2ST pipeline with style-transfer capability on the basis of discrete self-supervised speech representations and codec units. The acoustic language model we introduce for style transfer leverages self-supervised in-context learning, acquiring style transfer ability without relying on any speaker-parallel data, thereby overcoming data scarcity. By using extensive training data, our model achieves zero-shot cross-lingual style transfer on previously unseen source languages. Experiments show that our model generates translated speeches with high fidelity and speaker similarity. Audio samples are available at http://stylelm.github.io/ .
Paper Structure (17 sections, 1 equation, 5 figures, 3 tables)

This paper contains 17 sections, 1 equation, 5 figures, 3 tables.

Figures (5)

  • Figure 1: We propose an S2ST approach with style transfer based on discrete representations from a self-supervised speech model and a neural codec. Figure (a) shows the inference pipeline of our method; figure (b) illustrates the self-supervised training process of the acoustic language model of $S_2$.
  • Figure 2: The multi-scale architecture of UniAudio used for the $S_2$ stage model.
  • Figure 3: Structure of the global transformer.
  • Figure 4: Screenshot of MOS testing.
  • Figure 5: Screenshot of SMOS testing.