Speech-to-Speech Translation with Discrete-Unit-Based Style Transfer

Yongqi Wang; Jionghao Bai; Rongjie Huang; Ruiqi Li; Zhiqing Hong; Zhou Zhao

Speech-to-Speech Translation with Discrete-Unit-Based Style Transfer

Yongqi Wang, Jionghao Bai, Rongjie Huang, Ruiqi Li, Zhiqing Hong, Zhou Zhao

TL;DR

The paper presents a direct speech-to-speech translation framework with cross-lingual style transfer that leverages discrete self-supervised representations for semantic content and detailed acoustic information. It decouples the pipeline into three stages—speech-to-semantic-unit translation, acoustic unit modeling with an in-context learning-based acoustic language model, and unit-to-wave generation using a GAN-based unit vocoder—enabling style transfer without speaker-parallel data and zero-shot generalization to unseen source languages. Key contributions include the modular S2ST design compatible with existing S2UT models, an acoustic LM trained with in-context learning to capture voice style, and demonstrated improvements in speech quality and speaker similarity, outperforming cascaded baselines. The work enables practical, style-preserving multilingual translations and highlights the potential and risks of scalable style transfer in S2ST systems.

Abstract

Direct speech-to-speech translation (S2ST) with discrete self-supervised representations has achieved remarkable accuracy, but is unable to preserve the speaker timbre of the source speech. Meanwhile, the scarcity of high-quality speaker-parallel data poses a challenge for learning style transfer during translation. We design an S2ST pipeline with style-transfer capability on the basis of discrete self-supervised speech representations and codec units. The acoustic language model we introduce for style transfer leverages self-supervised in-context learning, acquiring style transfer ability without relying on any speaker-parallel data, thereby overcoming data scarcity. By using extensive training data, our model achieves zero-shot cross-lingual style transfer on previously unseen source languages. Experiments show that our model generates translated speeches with high fidelity and speaker similarity. Audio samples are available at http://stylelm.github.io/ .

Speech-to-Speech Translation with Discrete-Unit-Based Style Transfer

TL;DR

Abstract

Paper Structure (17 sections, 1 equation, 5 figures, 3 tables)

This paper contains 17 sections, 1 equation, 5 figures, 3 tables.

Introduction
Method
Semantic and Acoustic Units
Speech-to-Semantic-Unit Translation
Acoustic Unit Modeling
Unit-to-Wave Generation
Experiments
Setup
Results and Analysis
Ablation Studies
Conclusions
Limitations and Potential Risks
Datasets
Model Settings
$S_2$ Model Architecture
...and 2 more sections

Figures (5)

Figure 1: We propose an S2ST approach with style transfer based on discrete representations from a self-supervised speech model and a neural codec. Figure (a) shows the inference pipeline of our method; figure (b) illustrates the self-supervised training process of the acoustic language model of $S_2$.
Figure 2: The multi-scale architecture of UniAudio used for the $S_2$ stage model.
Figure 3: Structure of the global transformer.
Figure 4: Screenshot of MOS testing.
Figure 5: Screenshot of SMOS testing.

Speech-to-Speech Translation with Discrete-Unit-Based Style Transfer

TL;DR

Abstract

Speech-to-Speech Translation with Discrete-Unit-Based Style Transfer

Authors

TL;DR

Abstract

Table of Contents

Figures (5)