MSLM-S2ST: A Multitask Speech Language Model for Textless Speech-to-Speech Translation with Speaker Style Preservation

Yifan Peng; Ilia Kulikov; Yilin Yang; Sravya Popuri; Hui Lu; Changhan Wang; Hongyu Gong

MSLM-S2ST: A Multitask Speech Language Model for Textless Speech-to-Speech Translation with Speaker Style Preservation

Yifan Peng, Ilia Kulikov, Yilin Yang, Sravya Popuri, Hui Lu, Changhan Wang, Hongyu Gong

TL;DR

The paper tackles textless speech-to-speech translation with speaker style preservation by introducing MLSM-S2ST, a decoder-only multitask speech language model that operates directly on speech representations. By using HuBERT-derived semantic units and EnCodec acoustic units, MLSM performs both semantic-to-semantic translation and semantic-to-acoustic generation within a single model, enabling bidirectional multilingual translation without text data. The approach demonstrates competitive translation quality and superior speaker style preservation compared to cascaded baselines, while reducing parameter count and enabling cross-lingual transfer. These findings suggest a practical path toward efficient, textless S2ST applicable to unwritten languages, with implications for multilingual communication and downstream speech synthesis tasks.

Abstract

There have been emerging research interest and advances in speech-to-speech translation (S2ST), translating utterances from one language to another. This work proposes Multitask Speech Language Model (MSLM), which is a decoder-only speech language model trained in a multitask setting. Without reliance on text training data, our model is able to support multilingual S2ST with speaker style preserved.

MSLM-S2ST: A Multitask Speech Language Model for Textless Speech-to-Speech Translation with Speaker Style Preservation

TL;DR

Abstract

Paper Structure (21 sections, 3 equations, 2 figures, 3 tables)

This paper contains 21 sections, 3 equations, 2 figures, 3 tables.

Introduction
Related Work
Proposed Method
Speech unit extraction
Semantic-to-semantic translation
Semantic-to-acoustic generation
Multitask training
Experiments
Experimental setup
Results
Conclusion
Limitations and Risks
Related Work
Proposed Method
Experiments
...and 6 more sections

Figures (2)

Figure 1: Inference procedure of speaker style-preserved S2ST. Previous studies polyvoices2st-style typically use separate LMs for each step and each translation direction, whereas our MSLM performs (a) and (b) in a single AR LM controlled by a special task token. MSLM also supports multilingual translation controlled by special language tokens.
Figure 2: Overall pipeline of speaker style-preserved S2ST. The source speech is first translated to target semantic units and then converted to target acoustic units. Finally, the target speech is synthesized using a pre-trained EnCodec decoder.

MSLM-S2ST: A Multitask Speech Language Model for Textless Speech-to-Speech Translation with Speaker Style Preservation

TL;DR

Abstract

MSLM-S2ST: A Multitask Speech Language Model for Textless Speech-to-Speech Translation with Speaker Style Preservation

Authors

TL;DR

Abstract

Table of Contents

Figures (2)