Auto-ACD: A Large-scale Dataset for Audio-Language Representation Learning
Luoyi Sun, Xuenan Xu, Mengyue Wu, Weidi Xie
TL;DR
Auto-ACD introduces a scalable automatic pipeline that fuses audio, visual, and environmental cues to generate a large-scale audio-text dataset with rich environmental information. By leveraging off-the-shelf vision, language, and audio models and an LLM for paraphrasing, the approach achieves 1.5M audio-text pairs with long, diverse captions, while filtering improves cross-modal coherence. The authors validate the dataset through audio-language retrieval, automatic captioning, and zero-shot environment classification, showing improvements over prior datasets and strong environmental understanding. This data-centric contribution enables more robust audio-language representations and provides a replicable blueprint for future large-scale multimodal datasets and benchmarks.
Abstract
Recently, the AI community has made significant strides in developing powerful foundation models, driven by large-scale multimodal datasets. However, for audio representation learning, existing datasets suffer from limitations in the following aspects: insufficient volume, simplistic content, and arduous collection procedures. To establish an audio dataset with high-quality captions, we propose an innovative, automatic approach leveraging multimodal inputs, such as video frames, audio streams. Specifically, we construct a large-scale, high-quality, audio-language dataset, named as Auto-ACD, comprising over 1.5M audio-text pairs. We exploit a series of pre-trained models or APIs, to determine audio-visual synchronisation, generate image captions, object detection, or audio tags for specific videos. Subsequently, we employ LLM to paraphrase a congruent caption for each audio, guided by the extracted multi-modality clues. To demonstrate the effectiveness of the proposed dataset, we train widely used models on our dataset and show performance improvement on various downstream tasks, for example, audio-language retrieval, audio captioning, zero-shot classification. In addition, we establish a novel benchmark with environmental information and provide a benchmark for audio-text tasks.
