PicoAudio2: Temporal Controllable Text-to-Audio Generation with Natural Language Description

Zihao Zheng; Zeyu Xie; Xuenan Xu; Wen Wu; Chao Zhang; Mengyue Wu

PicoAudio2: Temporal Controllable Text-to-Audio Generation with Natural Language Description

Zihao Zheng, Zeyu Xie, Xuenan Xu, Wen Wu, Chao Zhang, Mengyue Wu

TL;DR

The paper addresses the gap between high-quality audio synthesis and fine-grained temporal control in text-to-audio generation. It introduces PicoAudio2, which combines temporally-aligned data curation (simulation and real data) with a timestamp-augmented diffusion framework that uses a dedicated timestamp matrix to guide generation from free-text descriptions. The approach yields improved temporal controllability and audio quality, validated by objective metrics and human judgments, with ablations showing the necessity of real data and the timestamp module. This work enables open-ended, language-driven TTA with better alignment to user descriptions, while highlighting future work on overlapping events and scalable real-data annotation.

Abstract

While recent work in controllable text-to-audio (TTA) generation has achieved fine-grained control through timestamp conditioning, its scope remains limited by audio quality and input format. These models often suffer from poor audio quality in real datasets due to sole reliance on synthetic data. Moreover, some models are constrained to a closed vocabulary of sound events, preventing them from controlling audio generation for open-ended, free-text queries. This paper introduces PicoAudio2, a framework that advances temporal-controllable TTA by mitigating these data and architectural limitations. Specifically, we use a grounding model to annotate event timestamps of real audio-text datasets to curate temporally-strong real data, in addition to simulation data from existing works. The model is trained on the combination of real and simulation data. Moreover, we propose an enhanced architecture that integrates the fine-grained information from a timestamp matrix with coarse-grained free-text input. Experiments show that PicoAudio2 exhibits superior performance in terms of temporal controllability and audio quality.

PicoAudio2: Temporal Controllable Text-to-Audio Generation with Natural Language Description

TL;DR

Abstract

PicoAudio2: Temporal Controllable Text-to-Audio Generation with Natural Language Description

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (3)