MSLM-S2ST: A Multitask Speech Language Model for Textless Speech-to-Speech Translation with Speaker Style Preservation
Yifan Peng, Ilia Kulikov, Yilin Yang, Sravya Popuri, Hui Lu, Changhan Wang, Hongyu Gong
TL;DR
The paper tackles textless speech-to-speech translation with speaker style preservation by introducing MLSM-S2ST, a decoder-only multitask speech language model that operates directly on speech representations. By using HuBERT-derived semantic units and EnCodec acoustic units, MLSM performs both semantic-to-semantic translation and semantic-to-acoustic generation within a single model, enabling bidirectional multilingual translation without text data. The approach demonstrates competitive translation quality and superior speaker style preservation compared to cascaded baselines, while reducing parameter count and enabling cross-lingual transfer. These findings suggest a practical path toward efficient, textless S2ST applicable to unwritten languages, with implications for multilingual communication and downstream speech synthesis tasks.
Abstract
There have been emerging research interest and advances in speech-to-speech translation (S2ST), translating utterances from one language to another. This work proposes Multitask Speech Language Model (MSLM), which is a decoder-only speech language model trained in a multitask setting. Without reliance on text training data, our model is able to support multilingual S2ST with speaker style preserved.
