Table of Contents
Fetching ...

PicoAudio2: Temporal Controllable Text-to-Audio Generation with Natural Language Description

Zihao Zheng, Zeyu Xie, Xuenan Xu, Wen Wu, Chao Zhang, Mengyue Wu

TL;DR

The paper addresses the gap between high-quality audio synthesis and fine-grained temporal control in text-to-audio generation. It introduces PicoAudio2, which combines temporally-aligned data curation (simulation and real data) with a timestamp-augmented diffusion framework that uses a dedicated timestamp matrix to guide generation from free-text descriptions. The approach yields improved temporal controllability and audio quality, validated by objective metrics and human judgments, with ablations showing the necessity of real data and the timestamp module. This work enables open-ended, language-driven TTA with better alignment to user descriptions, while highlighting future work on overlapping events and scalable real-data annotation.

Abstract

While recent work in controllable text-to-audio (TTA) generation has achieved fine-grained control through timestamp conditioning, its scope remains limited by audio quality and input format. These models often suffer from poor audio quality in real datasets due to sole reliance on synthetic data. Moreover, some models are constrained to a closed vocabulary of sound events, preventing them from controlling audio generation for open-ended, free-text queries. This paper introduces PicoAudio2, a framework that advances temporal-controllable TTA by mitigating these data and architectural limitations. Specifically, we use a grounding model to annotate event timestamps of real audio-text datasets to curate temporally-strong real data, in addition to simulation data from existing works. The model is trained on the combination of real and simulation data. Moreover, we propose an enhanced architecture that integrates the fine-grained information from a timestamp matrix with coarse-grained free-text input. Experiments show that PicoAudio2 exhibits superior performance in terms of temporal controllability and audio quality.

PicoAudio2: Temporal Controllable Text-to-Audio Generation with Natural Language Description

TL;DR

The paper addresses the gap between high-quality audio synthesis and fine-grained temporal control in text-to-audio generation. It introduces PicoAudio2, which combines temporally-aligned data curation (simulation and real data) with a timestamp-augmented diffusion framework that uses a dedicated timestamp matrix to guide generation from free-text descriptions. The approach yields improved temporal controllability and audio quality, validated by objective metrics and human judgments, with ablations showing the necessity of real data and the timestamp module. This work enables open-ended, language-driven TTA with better alignment to user descriptions, while highlighting future work on overlapping events and scalable real-data annotation.

Abstract

While recent work in controllable text-to-audio (TTA) generation has achieved fine-grained control through timestamp conditioning, its scope remains limited by audio quality and input format. These models often suffer from poor audio quality in real datasets due to sole reliance on synthetic data. Moreover, some models are constrained to a closed vocabulary of sound events, preventing them from controlling audio generation for open-ended, free-text queries. This paper introduces PicoAudio2, a framework that advances temporal-controllable TTA by mitigating these data and architectural limitations. Specifically, we use a grounding model to annotate event timestamps of real audio-text datasets to curate temporally-strong real data, in addition to simulation data from existing works. The model is trained on the combination of real and simulation data. Moreover, we propose an enhanced architecture that integrates the fine-grained information from a timestamp matrix with coarse-grained free-text input. Experiments show that PicoAudio2 exhibits superior performance in terms of temporal controllability and audio quality.

Paper Structure

This paper contains 12 sections, 4 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: The data curation pipeline. The left part shows the real dataset processing pipeline, where the TAG model extracts event timestamps and data with omissions or overlaps are excluded. The right part shows the data simulation pipeline, where multi-event audio is simulated from preprocessed single-event segments with precise timestamp information. Captions are obtained by concatenating single-event descriptions.
  • Figure 2: PicoAudio2 framework. The red arrow represents the training process while the blue represents inference. During inference, users can either provide detailed timestamps for each events, or a coarse description for LLM to infer the timestamp information.
  • Figure 3: During inference, users can provide TDC like (a), or TCC, which will be transformed to TDC like (b) and (c).