Table of Contents
Fetching ...

BatonVoice: An Operationalist Framework for Enhancing Controllable Speech Synthesis with Linguistic Intelligence from LLMs

Yue Wang, Ruotian Ma, Xingyu Chen, Zhengliang Shi, Wanshun Chen, Huang Liu, Jiadi Yao, Qu Yang, Qingxuan Jiang, Fanghua Ye, Juntao Li, Min Zhang, Zhaopeng Tu, Xiaolong Li, Linus

TL;DR

BatonVoice introduces an operationalist, conductor–orchestra paradigm for controllable speech synthesis, decoupling LLM instruction understanding from TTS generation by translating user prompts into textual vocal features that BatonTTS renders as speech. The approach enables strong emotion control with high intelligibility and exhibits remarkable zero-shot cross-lingual generalization to unseen languages. The authors present a three-stage training pipeline—Pre-Train, SFT, and APO-down—that leverages automated vocal plans and a frozen CosyVoice2 decoder, achieving superior results without manual instruction data. Empirical results show scalable improvements with larger LLMs and robust ablations, highlighting the framework's practical impact for expressive TTS and its potential applicability to other modalities through textual feature representations.

Abstract

The rise of Large Language Models (LLMs) is reshaping multimodel models, with speech synthesis being a prominent application. However, existing approaches often underutilize the linguistic intelligence of these models, typically failing to leverage their powerful instruction-following capabilities. This limitation hinders the model's ability to follow text instructions for controllable Text-to-Speech~(TTS). To address this, we propose a new paradigm inspired by ``operationalism'' that decouples instruction understanding from speech generation. We introduce BatonVoice, a framework where an LLM acts as a ``conductor'', understanding user instructions and generating a textual ``plan'' -- explicit vocal features (e.g., pitch, energy). A separate TTS model, the ``orchestra'', then generates the speech from these features. To realize this component, we develop BatonTTS, a TTS model trained specifically for this task. Our experiments demonstrate that BatonVoice achieves strong performance in controllable and emotional speech synthesis, outperforming strong open- and closed-source baselines. Notably, our approach enables remarkable zero-shot cross-lingual generalization, accurately applying feature control abilities to languages unseen during post-training. This demonstrates that objectifying speech into textual vocal features can more effectively unlock the linguistic intelligence of LLMs.

BatonVoice: An Operationalist Framework for Enhancing Controllable Speech Synthesis with Linguistic Intelligence from LLMs

TL;DR

BatonVoice introduces an operationalist, conductor–orchestra paradigm for controllable speech synthesis, decoupling LLM instruction understanding from TTS generation by translating user prompts into textual vocal features that BatonTTS renders as speech. The approach enables strong emotion control with high intelligibility and exhibits remarkable zero-shot cross-lingual generalization to unseen languages. The authors present a three-stage training pipeline—Pre-Train, SFT, and APO-down—that leverages automated vocal plans and a frozen CosyVoice2 decoder, achieving superior results without manual instruction data. Empirical results show scalable improvements with larger LLMs and robust ablations, highlighting the framework's practical impact for expressive TTS and its potential applicability to other modalities through textual feature representations.

Abstract

The rise of Large Language Models (LLMs) is reshaping multimodel models, with speech synthesis being a prominent application. However, existing approaches often underutilize the linguistic intelligence of these models, typically failing to leverage their powerful instruction-following capabilities. This limitation hinders the model's ability to follow text instructions for controllable Text-to-Speech~(TTS). To address this, we propose a new paradigm inspired by ``operationalism'' that decouples instruction understanding from speech generation. We introduce BatonVoice, a framework where an LLM acts as a ``conductor'', understanding user instructions and generating a textual ``plan'' -- explicit vocal features (e.g., pitch, energy). A separate TTS model, the ``orchestra'', then generates the speech from these features. To realize this component, we develop BatonTTS, a TTS model trained specifically for this task. Our experiments demonstrate that BatonVoice achieves strong performance in controllable and emotional speech synthesis, outperforming strong open- and closed-source baselines. Notably, our approach enables remarkable zero-shot cross-lingual generalization, accurately applying feature control abilities to languages unseen during post-training. This demonstrates that objectifying speech into textual vocal features can more effectively unlock the linguistic intelligence of LLMs.

Paper Structure

This paper contains 32 sections, 3 equations, 3 figures, 5 tables.

Figures (3)

  • Figure 1: Illustration of BatonVoice: (1) An LLM, acting as a conductor, interprets the user's instructions and generates explicit vocal features. (2) These features are then fed into BatonTTS model, the orchestra, which synthesizes the final speech. This separation allows the LLM to leverage its linguistic intelligence to guide the synthesis process, enabling controllable TTS.
  • Figure 2: Overview of the SFT stage of the BatonTTS framework. We extract vocal features from speech and verbalize them into a textual format.
  • Figure 3: Analysis of the key components of the proposed BatonVoice.