Table of Contents
Fetching ...

PicoAudio: Enabling Precise Timestamp and Frequency Controllability of Audio Events in Text-to-audio Generation

Zeyu Xie, Xuenan Xu, Zhizheng Wu, Mengyue Wu

TL;DR

PicoAudio tackles the challenge of precise temporal controllability in text-to-audio generation by combining a data-driven data simulation pipeline, LLM-assisted textual transformations, a VAE-based audio representation, and a diffusion model conditioned on a timestamp matrix and event embeddings. The approach enables exact millisecond-scale timestamp control (40 ms resolution) and frequency control via generated timestamp captions, outperforming mainstream baselines on objective metrics like F1_segment and L1_freq, and subjective MOS. A key strength is the data construction pipeline that provides temporally-aligned audio-text pairs, enabling the diffusion model to learn tight temporal associations; GPT-4 further extends controllability to arbitrary temporal expressions. The results imply practical applicability for temporally precise audio content creation and highlight avenues for scaling to more events and richer temporal relations.

Abstract

Recently, audio generation tasks have attracted considerable research interests. Precise temporal controllability is essential to integrate audio generation with real applications. In this work, we propose a temporal controlled audio generation framework, PicoAudio. PicoAudio integrates temporal information to guide audio generation through tailored model design. It leverages data crawling, segmentation, filtering, and simulation of fine-grained temporally-aligned audio-text data. Both subjective and objective evaluations demonstrate that PicoAudio dramantically surpasses current state-of-the-art generation models in terms of timestamp and occurrence frequency controllability. The generated samples are available on the demo website https://zeyuxie29.github.io/PicoAudio.github.io.

PicoAudio: Enabling Precise Timestamp and Frequency Controllability of Audio Events in Text-to-audio Generation

TL;DR

PicoAudio tackles the challenge of precise temporal controllability in text-to-audio generation by combining a data-driven data simulation pipeline, LLM-assisted textual transformations, a VAE-based audio representation, and a diffusion model conditioned on a timestamp matrix and event embeddings. The approach enables exact millisecond-scale timestamp control (40 ms resolution) and frequency control via generated timestamp captions, outperforming mainstream baselines on objective metrics like F1_segment and L1_freq, and subjective MOS. A key strength is the data construction pipeline that provides temporally-aligned audio-text pairs, enabling the diffusion model to learn tight temporal associations; GPT-4 further extends controllability to arbitrary temporal expressions. The results imply practical applicability for temporally precise audio content creation and highlight avenues for scaling to more events and richer temporal relations.

Abstract

Recently, audio generation tasks have attracted considerable research interests. Precise temporal controllability is essential to integrate audio generation with real applications. In this work, we propose a temporal controlled audio generation framework, PicoAudio. PicoAudio integrates temporal information to guide audio generation through tailored model design. It leverages data crawling, segmentation, filtering, and simulation of fine-grained temporally-aligned audio-text data. Both subjective and objective evaluations demonstrate that PicoAudio dramantically surpasses current state-of-the-art generation models in terms of timestamp and occurrence frequency controllability. The generated samples are available on the demo website https://zeyuxie29.github.io/PicoAudio.github.io.
Paper Structure (15 sections, 5 equations, 2 figures, 1 table)

This paper contains 15 sections, 5 equations, 2 figures, 1 table.

Figures (2)

  • Figure 1: Illustration of controlling timestamp / occurrence frequency of audio events by PicoAudio. It can enable precise controlling of single events or multiple events.
  • Figure 2: PicoAudio Flowchart. (Left) illustrates the simulation pipeline, wherein data is crawled from the Internet, segmented and filtered, resulting in one-occurrence segments stored in a database. Pairs of audio, timestamp captions, and frequency captions are simulated from the database. (Right) showcases the model framework. Red arrows indicate the model training process by using the simulated data. Blue arrows indicate inference based on timestamp or frequency captions, where the LLM is prompted with the simulated training data.