A Detailed Audio-Text Data Simulation Pipeline using Single-Event Sounds

Xuenan Xu; Xiaohang Xu; Zeyu Xie; Pingyue Zhang; Mengyue Wu; Kai Yu

A Detailed Audio-Text Data Simulation Pipeline using Single-Event Sounds

Xuenan Xu, Xiaohang Xu, Zeyu Xie, Pingyue Zhang, Mengyue Wu, Kai Yu

TL;DR

This paper first analyzes the detailed information that human descriptions of audio may contain beyond sound event labels, and proposes an automatic pipeline for curating audio-text pairs with rich details in text descriptions.

Abstract

Recently, there has been an increasing focus on audio-text cross-modal learning. However, most of the existing audio-text datasets contain only simple descriptions of sound events. Compared with classification labels, the advantages of such descriptions are significantly limited. In this paper, we first analyze the detailed information that human descriptions of audio may contain beyond sound event labels. Based on the analysis, we propose an automatic pipeline for curating audio-text pairs with rich details. Leveraging the property that sounds can be mixed and concatenated in the time domain, we control details in four aspects: temporal relationship, loudness, speaker identity, and occurrence number, in simulating audio mixtures. Corresponding details are transformed into captions by large language models. Audio-text pairs with rich details in text descriptions are thereby obtained. We validate the effectiveness of our pipeline with a small amount of simulated data, demonstrating that the simulated data enables models to learn detailed audio captioning.

A Detailed Audio-Text Data Simulation Pipeline using Single-Event Sounds

TL;DR

Abstract

Paper Structure (14 sections, 2 figures, 4 tables)

This paper contains 14 sections, 2 figures, 4 tables.

Introduction
Auditory Detail Taxonomy based on Human Perception
Sound Event Categories from Text Clustering
Details in Sound Descriptions
Detailed Audio-Text Simulation Pipeline
Single-event Sound Curation
Audio-Text Simulation Pipeline
Experimental Setup
Data Simulation
Hyper-parameters
Results
Objective Metrics
Human Evaluation
Conclusion

Figures (2)

Figure 1: The pipeline of single-event sound collection from Freesound.
Figure 2: The pipeline of simulating audio-text pairs that are rich in details using automatically-curated single sound sources and large language models.

A Detailed Audio-Text Data Simulation Pipeline using Single-Event Sounds

TL;DR

Abstract

A Detailed Audio-Text Data Simulation Pipeline using Single-Event Sounds

Authors

TL;DR

Abstract

Table of Contents

Figures (2)