Table of Contents
Fetching ...

SLM-S2ST: A multimodal language model for direct speech-to-speech translation

Yuxuan Hu, Haibin Wu, Ruchao Fan, Xiaofei Wang, Heng Lu, Yao Qian, Jinyu Li

TL;DR

SLM-S2ST advances direct speech-to-speech translation by endowing a pretrained multimodal LM (Phi4-MM) with a delayed audio head and a streaming vocoder, enabling simultaneous text and speech output. Trained primarily on the CVSS-C dataset (≈905 hours) and augmented with ~11k hours of in-house data, the model achieves state-of-the-art S2ST BLEU on CVSS-C and strong S2TT performance, while maintaining competitive results when scaled to 7B parameters. The approach freezes the base LLM, optimizes an audio post-LM via LoRA, and uses a streaming flow-based decoder plus HiFi-GAN for real-time waveform synthesis, promoting accessibility and reproducibility with open-source components. These results demonstrate effective end-to-end speech-to-speech translation with improved speech-text alignment and scalability, suggesting a practical pathway for broader adoption and extension to other speech-to-speech tasks.

Abstract

Speech-aware language models (LMs) have demonstrated capabilities in understanding spoken language while generating text-based responses. However, enabling them to produce speech output efficiently and effectively remains a challenge. In this paper, we present SLM-S2ST, a multimodal LM for direct speech-to-speech translation (S2ST), built on the open-source Phi4-MM model. SLM-S2ST extends its predecessor by generating translated speech using an audio transformer head that predicts audio tokens with a delay relative to text tokens, followed by a streaming vocoder for waveform synthesis. Our experimental results on the CVSS-C dataset demonstrate SLM-S2ST's superior performance, significantly surpassing existing baseline models trained on the same dataset. Furthermore, when we scale up the training data and the model size, SLM-S2ST reaches on-par performance with the current SOTA model.

SLM-S2ST: A multimodal language model for direct speech-to-speech translation

TL;DR

SLM-S2ST advances direct speech-to-speech translation by endowing a pretrained multimodal LM (Phi4-MM) with a delayed audio head and a streaming vocoder, enabling simultaneous text and speech output. Trained primarily on the CVSS-C dataset (≈905 hours) and augmented with ~11k hours of in-house data, the model achieves state-of-the-art S2ST BLEU on CVSS-C and strong S2TT performance, while maintaining competitive results when scaled to 7B parameters. The approach freezes the base LLM, optimizes an audio post-LM via LoRA, and uses a streaming flow-based decoder plus HiFi-GAN for real-time waveform synthesis, promoting accessibility and reproducibility with open-source components. These results demonstrate effective end-to-end speech-to-speech translation with improved speech-text alignment and scalability, suggesting a practical pathway for broader adoption and extension to other speech-to-speech tasks.

Abstract

Speech-aware language models (LMs) have demonstrated capabilities in understanding spoken language while generating text-based responses. However, enabling them to produce speech output efficiently and effectively remains a challenge. In this paper, we present SLM-S2ST, a multimodal LM for direct speech-to-speech translation (S2ST), built on the open-source Phi4-MM model. SLM-S2ST extends its predecessor by generating translated speech using an audio transformer head that predicts audio tokens with a delay relative to text tokens, followed by a streaming vocoder for waveform synthesis. Our experimental results on the CVSS-C dataset demonstrate SLM-S2ST's superior performance, significantly surpassing existing baseline models trained on the same dataset. Furthermore, when we scale up the training data and the model size, SLM-S2ST reaches on-par performance with the current SOTA model.

Paper Structure

This paper contains 16 sections, 1 figure, 5 tables.

Figures (1)

  • Figure 1: The proposed framework for end-to-end speech-to-speech translation. The Phi4-MM model includes Phi4-MM Shared Layer and Text Post-LM. The Audio Post-LM is initialized with the Text Post-LM, and is trainable.