Table of Contents
Fetching ...

MSLM-S2ST: A Multitask Speech Language Model for Textless Speech-to-Speech Translation with Speaker Style Preservation

Yifan Peng, Ilia Kulikov, Yilin Yang, Sravya Popuri, Hui Lu, Changhan Wang, Hongyu Gong

TL;DR

The paper tackles textless speech-to-speech translation with speaker style preservation by introducing MLSM-S2ST, a decoder-only multitask speech language model that operates directly on speech representations. By using HuBERT-derived semantic units and EnCodec acoustic units, MLSM performs both semantic-to-semantic translation and semantic-to-acoustic generation within a single model, enabling bidirectional multilingual translation without text data. The approach demonstrates competitive translation quality and superior speaker style preservation compared to cascaded baselines, while reducing parameter count and enabling cross-lingual transfer. These findings suggest a practical path toward efficient, textless S2ST applicable to unwritten languages, with implications for multilingual communication and downstream speech synthesis tasks.

Abstract

There have been emerging research interest and advances in speech-to-speech translation (S2ST), translating utterances from one language to another. This work proposes Multitask Speech Language Model (MSLM), which is a decoder-only speech language model trained in a multitask setting. Without reliance on text training data, our model is able to support multilingual S2ST with speaker style preserved.

MSLM-S2ST: A Multitask Speech Language Model for Textless Speech-to-Speech Translation with Speaker Style Preservation

TL;DR

The paper tackles textless speech-to-speech translation with speaker style preservation by introducing MLSM-S2ST, a decoder-only multitask speech language model that operates directly on speech representations. By using HuBERT-derived semantic units and EnCodec acoustic units, MLSM performs both semantic-to-semantic translation and semantic-to-acoustic generation within a single model, enabling bidirectional multilingual translation without text data. The approach demonstrates competitive translation quality and superior speaker style preservation compared to cascaded baselines, while reducing parameter count and enabling cross-lingual transfer. These findings suggest a practical path toward efficient, textless S2ST applicable to unwritten languages, with implications for multilingual communication and downstream speech synthesis tasks.

Abstract

There have been emerging research interest and advances in speech-to-speech translation (S2ST), translating utterances from one language to another. This work proposes Multitask Speech Language Model (MSLM), which is a decoder-only speech language model trained in a multitask setting. Without reliance on text training data, our model is able to support multilingual S2ST with speaker style preserved.
Paper Structure (21 sections, 3 equations, 2 figures, 3 tables)

This paper contains 21 sections, 3 equations, 2 figures, 3 tables.

Figures (2)

  • Figure 1: Inference procedure of speaker style-preserved S2ST. Previous studies polyvoices2st-style typically use separate LMs for each step and each translation direction, whereas our MSLM performs (a) and (b) in a single AR LM controlled by a special task token. MSLM also supports multilingual translation controlled by special language tokens.
  • Figure 2: Overall pipeline of speaker style-preserved S2ST. The source speech is first translated to target semantic units and then converted to target acoustic units. Finally, the target speech is synthesized using a pre-trained EnCodec decoder.