Table of Contents
Fetching ...

InstructAudio: Unified speech and music generation with natural language instruction

Chunyu Qiang, Kang Yin, Xiaopeng Wang, Yuzhe Liang, Jiahui Zhao, Ruibo Fu, Tianrui Wang, Cheng Gong, Chen Zhang, Longbiao Wang, Jianwu Dang

TL;DR

This paper addresses the lack of a unified, instruction-driven approach for both speech and music generation. It introduces InstructAudio, a multimodal diffusion framework (MM-DiT) with a Latent Audio Codec and a standardized instruction-phoneme input that supports text-based control over timbre, paralinguistics, and musical attributes, enabling dialogue generation in English and Chinese. Trained on 50k hours of speech and 20k hours of music, the model achieves state-of-the-art instruction-based TTS performance and competitive music generation, all without relying on reference audio for conditioning. The work demonstrates the feasibility of joint TTS and TTM modeling with unified inputs, offering significant potential for cross-modal audio synthesis and more flexible conversational AI applications.

Abstract

Text-to-speech (TTS) and text-to-music (TTM) models face significant limitations in instruction-based control. TTS systems usually depend on reference audio for timbre, offer only limited text-level attribute control, and rarely support dialogue generation. TTM systems are constrained by input conditioning requirements that depend on expert knowledge annotations. The high heterogeneity of these input control conditions makes them difficult to joint modeling with speech synthesis. Despite sharing common acoustic modeling characteristics, these two tasks have long been developed independently, leaving open the challenge of achieving unified modeling through natural language instructions. We introduce InstructAudio, a unified framework that enables instruction-based (natural language descriptions) control of acoustic attributes including timbre (gender, age), paralinguistic (emotion, style, accent), and musical (genre, instrument, rhythm, atmosphere). It supports expressive speech, music, and dialogue generation in English and Chinese. The model employs joint and single diffusion transformer layers with a standardized instruction-phoneme input format, trained on 50K hours of speech and 20K hours of music data, enabling multi-task learning and cross-modal alignment. Fig. 1 visualizes performance comparisons with mainstream TTS and TTM models, demonstrating that InstructAudio achieves optimal results on most metrics. To our best knowledge, InstructAudio represents the first instruction-controlled framework unifying speech and music generation. Audio samples are available at: https://qiangchunyu.github.io/InstructAudio/

InstructAudio: Unified speech and music generation with natural language instruction

TL;DR

This paper addresses the lack of a unified, instruction-driven approach for both speech and music generation. It introduces InstructAudio, a multimodal diffusion framework (MM-DiT) with a Latent Audio Codec and a standardized instruction-phoneme input that supports text-based control over timbre, paralinguistics, and musical attributes, enabling dialogue generation in English and Chinese. Trained on 50k hours of speech and 20k hours of music, the model achieves state-of-the-art instruction-based TTS performance and competitive music generation, all without relying on reference audio for conditioning. The work demonstrates the feasibility of joint TTS and TTM modeling with unified inputs, offering significant potential for cross-modal audio synthesis and more flexible conversational AI applications.

Abstract

Text-to-speech (TTS) and text-to-music (TTM) models face significant limitations in instruction-based control. TTS systems usually depend on reference audio for timbre, offer only limited text-level attribute control, and rarely support dialogue generation. TTM systems are constrained by input conditioning requirements that depend on expert knowledge annotations. The high heterogeneity of these input control conditions makes them difficult to joint modeling with speech synthesis. Despite sharing common acoustic modeling characteristics, these two tasks have long been developed independently, leaving open the challenge of achieving unified modeling through natural language instructions. We introduce InstructAudio, a unified framework that enables instruction-based (natural language descriptions) control of acoustic attributes including timbre (gender, age), paralinguistic (emotion, style, accent), and musical (genre, instrument, rhythm, atmosphere). It supports expressive speech, music, and dialogue generation in English and Chinese. The model employs joint and single diffusion transformer layers with a standardized instruction-phoneme input format, trained on 50K hours of speech and 20K hours of music data, enabling multi-task learning and cross-modal alignment. Fig. 1 visualizes performance comparisons with mainstream TTS and TTM models, demonstrating that InstructAudio achieves optimal results on most metrics. To our best knowledge, InstructAudio represents the first instruction-controlled framework unifying speech and music generation. Audio samples are available at: https://qiangchunyu.github.io/InstructAudio/

Paper Structure

This paper contains 12 sections, 2 figures, 3 tables.

Figures (2)

  • Figure 1: Comparing model capabilities across TTS and TTM tasks. The chart shows normalized performance on 13 metrics: SeedTTS-WER anastassiou2024seed, TTS-Control, and SongEval yao2025songeval. InstructAudio (red line) uniquely supports all evaluation dimensions, demonstrating best performance in both TTS and TTM while providing comprehensive controllability across multiple attributes.
  • Figure 2: InstructAudio achieves unified generation of both speech and music through an MM-DiT architecture. This framework enables multi-attribute control through natural language instructions. The input format remains consistent across different tasks, comprising a natural language instruction description along with corresponding text or lyrics. Audio is represented using continuous latents extracted from a pre-trained Mel-VAE. During inference, the VAE latent of the target speech or music is obtained through an ODE solver.