Table of Contents
Fetching ...

PolyVoice: Language Models for Speech to Speech Translation

Qianqian Dong, Zhiying Huang, Qiao Tian, Chen Xu, Tom Ko, Yunlong Zhao, Siyuan Feng, Tang Li, Kexin Wang, Xuxin Cheng, Fengpeng Yue, Ye Bai, Xi Chen, Lu Lu, Zejun Ma, Yuping Wang, Mingxuan Wang, Yuxuan Wang

TL;DR

PolyVoice presents a two-LM pipeline for speech-to-speech translation that operates on discretized semantic and acoustic units. A decoder-only S2UT (semantic-unit translation) front-end converts source speech units to target-language units, while a U2S (unit-to-speech) back-end with a SoundStream-based codec preserves speaker voice and style during synthesis. The approach enables unwritten-language translation by avoiding reliance on text transcripts and leveraging prompts to fuse diverse data sources in training. Results on Chinese-English and English-Spanish tasks show competitive translation quality and superior speech naturalness and voice cloning, with ablations highlighting the importance of the duration model and decoder-only architecture. The work demonstrates a practical, LM-based framework for S2ST that can handle unwritten languages and preserve source speaker characteristics, with clear avenues for scaling and improved semantic-unit extraction.

Abstract

We propose PolyVoice, a language model-based framework for speech-to-speech translation (S2ST) system. Our framework consists of two language models: a translation language model and a speech synthesis language model. We use discretized speech units, which are generated in a fully unsupervised way, and thus our framework can be used for unwritten languages. For the speech synthesis part, we adopt the existing VALL-E X approach and build a unit-based audio language model. This grants our framework the ability to preserve the voice characteristics and the speaking style of the original speech. We examine our system on Chinese $\rightarrow$ English and English $\rightarrow$ Spanish pairs. Experimental results show that our system can generate speech with high translation quality and audio quality. Speech samples are available at https://speechtranslation.github.io/polyvoice.

PolyVoice: Language Models for Speech to Speech Translation

TL;DR

PolyVoice presents a two-LM pipeline for speech-to-speech translation that operates on discretized semantic and acoustic units. A decoder-only S2UT (semantic-unit translation) front-end converts source speech units to target-language units, while a U2S (unit-to-speech) back-end with a SoundStream-based codec preserves speaker voice and style during synthesis. The approach enables unwritten-language translation by avoiding reliance on text transcripts and leveraging prompts to fuse diverse data sources in training. Results on Chinese-English and English-Spanish tasks show competitive translation quality and superior speech naturalness and voice cloning, with ablations highlighting the importance of the duration model and decoder-only architecture. The work demonstrates a practical, LM-based framework for S2ST that can handle unwritten languages and preserve source speaker characteristics, with clear avenues for scaling and improved semantic-unit extraction.

Abstract

We propose PolyVoice, a language model-based framework for speech-to-speech translation (S2ST) system. Our framework consists of two language models: a translation language model and a speech synthesis language model. We use discretized speech units, which are generated in a fully unsupervised way, and thus our framework can be used for unwritten languages. For the speech synthesis part, we adopt the existing VALL-E X approach and build a unit-based audio language model. This grants our framework the ability to preserve the voice characteristics and the speaking style of the original speech. We examine our system on Chinese English and English Spanish pairs. Experimental results show that our system can generate speech with high translation quality and audio quality. Speech samples are available at https://speechtranslation.github.io/polyvoice.
Paper Structure (33 sections, 1 figure, 7 tables)

This paper contains 33 sections, 1 figure, 7 tables.

Figures (1)

  • Figure 1: Overview of PolyVoice. The framework consists of two LM-based components: a S2UT front-end for translation and a U2S back-end for synthesis.