SeamlessExpressiveLM: Speech Language Model for Expressive Speech-to-Speech Translation with Chain-of-Thought
Hongyu Gong, Bandhav Veluri
TL;DR
SeamlessExpressiveLM tackles expressive speech-to-speech translation with a single decoder-only LM that unifies semantic and multi-stream acoustic generation via chain-of-thought prompting. By operating on HuBERT semantic units and EnCodec acoustic tokens, and by using randomly cropped acoustic prompts to inject style without style-aligned data, it learns end-to-end S2ST without cascaded networks. The model achieves higher vocal style similarity and competitive semantic accuracy while improving parameter efficiency, outperforming cascaded LMs on style transfer and maintaining strong semantics on Spanish/Hungarian to English tasks. This end-to-end approach reduces computational overhead and mitigates error propagation typical of cascaded systems, enhancing practical expressivity in cross-lingual speech translation.
Abstract
Expressive speech-to-speech translation (S2ST) is a key research topic in seamless communication, which focuses on the preservation of semantics and speaker vocal style in translated speech. Early works synthesized speaker style aligned speech in order to directly learn the mapping from speech to target speech spectrogram. Without reliance on style aligned data, recent studies leverage the advances of language modeling (LM) and build cascaded LMs on semantic and acoustic tokens. This work proposes SeamlessExpressiveLM, a single speech language model for expressive S2ST. We decompose the complex source-to-target speech mapping into intermediate generation steps with chain-of-thought prompting. The model is first guided to translate target semantic content and then transfer the speaker style to multi-stream acoustic units. Evaluated on Spanish-to-English and Hungarian-to-English translations, SeamlessExpressiveLM outperforms cascaded LMs in both semantic quality and style transfer, meanwhile achieving better parameter efficiency.
