Table of Contents
Fetching ...

ControlAudio: Tackling Text-Guided, Timing-Indicated and Intelligible Audio Generation via Progressive Diffusion Modeling

Yuxuan Jiang, Zehua Chen, Zeqian Ju, Yusheng Dai, Weibei Dou, Jun Zhu

TL;DR

ControlAudio reframes text-to-audio generation as a multi-task problem and introduces progressive diffusion modeling to jointly leverage text, timing, and phoneme information. It builds a unified conditioning pipeline via a Structured Prompt and augments data with annotated and simulated datasets, then trains the model in three progressive stages while applying progressively guided sampling to align coarse temporal structure with fine-grained speech content. Empirical results show state-of-the-art temporal accuracy and speech clarity across multiple benchmarks, with thorough ablations validating prompt design, vocabulary granularity, and sampling strategy. This approach enables scalable, controllable TTA while highlighting practical considerations for data quality and responsible deployment.

Abstract

Text-to-audio (TTA) generation with fine-grained control signals, e.g., precise timing control or intelligible speech content, has been explored in recent works. However, constrained by data scarcity, their generation performance at scale is still compromised. In this study, we recast controllable TTA generation as a multi-task learning problem and introduce a progressive diffusion modeling approach, ControlAudio. Our method adeptly fits distributions conditioned on more fine-grained information, including text, timing, and phoneme features, through a step-by-step strategy. First, we propose a data construction method spanning both annotation and simulation, augmenting condition information in the sequence of text, timing, and phoneme. Second, at the model training stage, we pretrain a diffusion transformer (DiT) on large-scale text-audio pairs, achieving scalable TTA generation, and then incrementally integrate the timing and phoneme features with unified semantic representations, expanding controllability. Finally, at the inference stage, we propose progressively guided generation, which sequentially emphasizes more fine-grained information, aligning inherently with the coarse-to-fine sampling nature of DiT. Extensive experiments show that ControlAudio achieves state-of-the-art performance in terms of temporal accuracy and speech clarity, significantly outperforming existing methods on both objective and subjective evaluations. Demo samples are available at: https://control-audio.github.io/Control-Audio.

ControlAudio: Tackling Text-Guided, Timing-Indicated and Intelligible Audio Generation via Progressive Diffusion Modeling

TL;DR

ControlAudio reframes text-to-audio generation as a multi-task problem and introduces progressive diffusion modeling to jointly leverage text, timing, and phoneme information. It builds a unified conditioning pipeline via a Structured Prompt and augments data with annotated and simulated datasets, then trains the model in three progressive stages while applying progressively guided sampling to align coarse temporal structure with fine-grained speech content. Empirical results show state-of-the-art temporal accuracy and speech clarity across multiple benchmarks, with thorough ablations validating prompt design, vocabulary granularity, and sampling strategy. This approach enables scalable, controllable TTA while highlighting practical considerations for data quality and responsible deployment.

Abstract

Text-to-audio (TTA) generation with fine-grained control signals, e.g., precise timing control or intelligible speech content, has been explored in recent works. However, constrained by data scarcity, their generation performance at scale is still compromised. In this study, we recast controllable TTA generation as a multi-task learning problem and introduce a progressive diffusion modeling approach, ControlAudio. Our method adeptly fits distributions conditioned on more fine-grained information, including text, timing, and phoneme features, through a step-by-step strategy. First, we propose a data construction method spanning both annotation and simulation, augmenting condition information in the sequence of text, timing, and phoneme. Second, at the model training stage, we pretrain a diffusion transformer (DiT) on large-scale text-audio pairs, achieving scalable TTA generation, and then incrementally integrate the timing and phoneme features with unified semantic representations, expanding controllability. Finally, at the inference stage, we propose progressively guided generation, which sequentially emphasizes more fine-grained information, aligning inherently with the coarse-to-fine sampling nature of DiT. Extensive experiments show that ControlAudio achieves state-of-the-art performance in terms of temporal accuracy and speech clarity, significantly outperforming existing methods on both objective and subjective evaluations. Demo samples are available at: https://control-audio.github.io/Control-Audio.

Paper Structure

This paper contains 33 sections, 6 equations, 5 figures, 8 tables.

Figures (5)

  • Figure 1: The end-to-end Progressive Diffusion Modeling of ControlAudio, which combines a progressive model training with a progressive guided sampling process for decoupled control of temporal structure and speech content.
  • Figure 2: An illustrative example for structured prompt.
  • Figure 3: Analysis of Progressive Sampling parameters ($w_{low}$, $w_{high}$). This study reveals a clear trade-off between audio quality and speech intelligibility.
  • Figure 4: Overview of our CoT-based LLM planning pipeline. Given a user-provided free-form caption, the LLM performs multi-step reasoning to extract audio events with their temporal spans, infer speech content when applicable, and generate a structured prompt that encodes both timing and content for controllable audio generation.
  • Figure 5: Gemini 2.5 pro for Speech Annotation.