Table of Contents
Fetching ...

VoxInstruct: Expressive Human Instruction-to-Speech Generation with Unified Multilingual Codec Language Modelling

Yixuan Zhou, Xiaoyu Qin, Zeyu Jin, Shuoyi Zhou, Shun Lei, Songtao Zhou, Zhiyong Wu, Jia Jia

TL;DR

VoxInstruct presents a unified multilingual codec language modeling framework that directly generates expressive speech from natural language instructions $x_{ins}$, combining content and style in a single prompt and optionally leveraging speech prompts for voice cloning. It introduces speech semantic tokens (ST) as an intermediate content guide and applies multiple classifier-free guidance strategies within a LLaMA-based autoregressive/non-autoregressive codec LM, with an MT5-based multilingual text encoder and a Vocos decoder. The model is trained in a pre-training/fine-tuning pipeline, first on transcript-only data to ensure quality and generalization, then on instruction-bearing data to enable fine-grained control over speech attributes, including cross-lingual and code-switched outputs. Empirical results show VoxInstruct achieves strong performance on instruction-to-speech tasks, robust stress control, and competitive zero-shot voice cloning, outperforming baselines in both objective metrics (WER, MCD, SECS) and subjective MOS scores, while ablations validate the contributions of ST tokens and CFG. Overall, the work advances instruction-driven audio generation by aligning speech with other AIGC modalities and enabling expressive, multilingual, and prompt-fusion capable speech synthesis.

Abstract

Recent AIGC systems possess the capability to generate digital multimedia content based on human language instructions, such as text, image and video. However, when it comes to speech, existing methods related to human instruction-to-speech generation exhibit two limitations. Firstly, they require the division of inputs into content prompt (transcript) and description prompt (style and speaker), instead of directly supporting human instruction. This division is less natural in form and does not align with other AIGC models. Secondly, the practice of utilizing an independent description prompt to model speech style, without considering the transcript content, restricts the ability to control speech at a fine-grained level. To address these limitations, we propose VoxInstruct, a novel unified multilingual codec language modeling framework that extends traditional text-to-speech tasks into a general human instruction-to-speech task. Our approach enhances the expressiveness of human instruction-guided speech generation and aligns the speech generation paradigm with other modalities. To enable the model to automatically extract the content of synthesized speech from raw text instructions, we introduce speech semantic tokens as an intermediate representation for instruction-to-content guidance. We also incorporate multiple Classifier-Free Guidance (CFG) strategies into our codec language model, which strengthens the generated speech following human instructions. Furthermore, our model architecture and training strategies allow for the simultaneous support of combining speech prompt and descriptive human instruction for expressive speech synthesis, which is a first-of-its-kind attempt. Codes, models and demos are at: https://github.com/thuhcsi/VoxInstruct.

VoxInstruct: Expressive Human Instruction-to-Speech Generation with Unified Multilingual Codec Language Modelling

TL;DR

VoxInstruct presents a unified multilingual codec language modeling framework that directly generates expressive speech from natural language instructions , combining content and style in a single prompt and optionally leveraging speech prompts for voice cloning. It introduces speech semantic tokens (ST) as an intermediate content guide and applies multiple classifier-free guidance strategies within a LLaMA-based autoregressive/non-autoregressive codec LM, with an MT5-based multilingual text encoder and a Vocos decoder. The model is trained in a pre-training/fine-tuning pipeline, first on transcript-only data to ensure quality and generalization, then on instruction-bearing data to enable fine-grained control over speech attributes, including cross-lingual and code-switched outputs. Empirical results show VoxInstruct achieves strong performance on instruction-to-speech tasks, robust stress control, and competitive zero-shot voice cloning, outperforming baselines in both objective metrics (WER, MCD, SECS) and subjective MOS scores, while ablations validate the contributions of ST tokens and CFG. Overall, the work advances instruction-driven audio generation by aligning speech with other AIGC modalities and enabling expressive, multilingual, and prompt-fusion capable speech synthesis.

Abstract

Recent AIGC systems possess the capability to generate digital multimedia content based on human language instructions, such as text, image and video. However, when it comes to speech, existing methods related to human instruction-to-speech generation exhibit two limitations. Firstly, they require the division of inputs into content prompt (transcript) and description prompt (style and speaker), instead of directly supporting human instruction. This division is less natural in form and does not align with other AIGC models. Secondly, the practice of utilizing an independent description prompt to model speech style, without considering the transcript content, restricts the ability to control speech at a fine-grained level. To address these limitations, we propose VoxInstruct, a novel unified multilingual codec language modeling framework that extends traditional text-to-speech tasks into a general human instruction-to-speech task. Our approach enhances the expressiveness of human instruction-guided speech generation and aligns the speech generation paradigm with other modalities. To enable the model to automatically extract the content of synthesized speech from raw text instructions, we introduce speech semantic tokens as an intermediate representation for instruction-to-content guidance. We also incorporate multiple Classifier-Free Guidance (CFG) strategies into our codec language model, which strengthens the generated speech following human instructions. Furthermore, our model architecture and training strategies allow for the simultaneous support of combining speech prompt and descriptive human instruction for expressive speech synthesis, which is a first-of-its-kind attempt. Codes, models and demos are at: https://github.com/thuhcsi/VoxInstruct.
Paper Structure (19 sections, 6 equations, 3 figures, 6 tables)

This paper contains 19 sections, 6 equations, 3 figures, 6 tables.

Figures (3)

  • Figure 1: The capabilities of the proposed expressive human instruction-to-speech generation model.
  • Figure 2: Model architecture.
  • Figure 3: Mel-spectrograms, pitch, and energy contours of speech generated according to human instructions for 4 test cases are depicted. Each subplot is annotated with its respective instruction input. In cases (a) and (b), only the instruction text is provided, whereas cases (c) and (d) also include a speech prompt. The SECS between these cases and the speech prompt (if provided) is displayed in the top left corner.