Table of Contents
Fetching ...

AudioTime: A Temporally-aligned Audio-text Benchmark Dataset

Zeyu Xie, Xuenan Xu, Zhizheng Wu, Mengyue Wu

TL;DR

The paper tackles the lack of temporally aligned audio-text datasets for fine-grained temporal control in audio generation by introducing AudioTime, a fully automated pipeline that curates clean, non-overlapping segments, simulates temporally structured audio with Scaper, and generates temporally rich captions via agentic LLMs. It also proposes STEAM, a text-based evaluation metric that quantifies ordering, duration, frequency, and timestamp control using an audio-text grounding model. AudioTime provides four temporal signals with 5000 training and 500 test instances per signal, along with a test set and evaluation framework to benchmark models on temporal alignment. Experiments reveal current models struggle with precise temporal control, while LLM-assisted approaches like Make-An-Audio2 demonstrate improvements by structuring free-text prompts for better temporal guidance.

Abstract

Recent advancements in audio generation have enabled the creation of high-fidelity audio clips from free-form textual descriptions. However, temporal relationships, a critical feature for audio content, are currently underrepresented in mainstream models, resulting in an imprecise temporal controllability. Specifically, users cannot accurately control the timestamps of sound events using free-form text. We acknowledge that a significant factor is the absence of high-quality, temporally-aligned audio-text datasets, which are essential for training models with temporal control. The more temporally-aligned the annotations, the better the models can understand the precise relationship between audio outputs and temporal textual prompts. Therefore, we present a strongly aligned audio-text dataset, AudioTime. It provides text annotations rich in temporal information such as timestamps, duration, frequency, and ordering, covering almost all aspects of temporal control. Additionally, we offer a comprehensive test set and evaluation metric to assess the temporal control performance of various models. Examples are available on the https://zeyuxie29.github.io/AudioTime/

AudioTime: A Temporally-aligned Audio-text Benchmark Dataset

TL;DR

The paper tackles the lack of temporally aligned audio-text datasets for fine-grained temporal control in audio generation by introducing AudioTime, a fully automated pipeline that curates clean, non-overlapping segments, simulates temporally structured audio with Scaper, and generates temporally rich captions via agentic LLMs. It also proposes STEAM, a text-based evaluation metric that quantifies ordering, duration, frequency, and timestamp control using an audio-text grounding model. AudioTime provides four temporal signals with 5000 training and 500 test instances per signal, along with a test set and evaluation framework to benchmark models on temporal alignment. Experiments reveal current models struggle with precise temporal control, while LLM-assisted approaches like Make-An-Audio2 demonstrate improvements by structuring free-text prompts for better temporal guidance.

Abstract

Recent advancements in audio generation have enabled the creation of high-fidelity audio clips from free-form textual descriptions. However, temporal relationships, a critical feature for audio content, are currently underrepresented in mainstream models, resulting in an imprecise temporal controllability. Specifically, users cannot accurately control the timestamps of sound events using free-form text. We acknowledge that a significant factor is the absence of high-quality, temporally-aligned audio-text datasets, which are essential for training models with temporal control. The more temporally-aligned the annotations, the better the models can understand the precise relationship between audio outputs and temporal textual prompts. Therefore, we present a strongly aligned audio-text dataset, AudioTime. It provides text annotations rich in temporal information such as timestamps, duration, frequency, and ordering, covering almost all aspects of temporal control. Additionally, we offer a comprehensive test set and evaluation metric to assess the temporal control performance of various models. Examples are available on the https://zeyuxie29.github.io/AudioTime/
Paper Structure (11 sections, 1 equation, 2 figures, 3 tables, 1 algorithm)

This paper contains 11 sections, 1 equation, 2 figures, 3 tables, 1 algorithm.

Figures (2)

  • Figure 1: Temporally-aligned audio-text samples in AudioTime.
  • Figure 2: Construction pipeline: (1) Acquire clean segments from AudioSet-Strong and filtering using CLAP and ATG models; (2) Simulate audio clips with the Scaper tool and record metadata; (3) Generate captions using two agentic LLMs.