Table of Contents
Fetching ...

Auto-ACD: A Large-scale Dataset for Audio-Language Representation Learning

Luoyi Sun, Xuenan Xu, Mengyue Wu, Weidi Xie

TL;DR

Auto-ACD introduces a scalable automatic pipeline that fuses audio, visual, and environmental cues to generate a large-scale audio-text dataset with rich environmental information. By leveraging off-the-shelf vision, language, and audio models and an LLM for paraphrasing, the approach achieves 1.5M audio-text pairs with long, diverse captions, while filtering improves cross-modal coherence. The authors validate the dataset through audio-language retrieval, automatic captioning, and zero-shot environment classification, showing improvements over prior datasets and strong environmental understanding. This data-centric contribution enables more robust audio-language representations and provides a replicable blueprint for future large-scale multimodal datasets and benchmarks.

Abstract

Recently, the AI community has made significant strides in developing powerful foundation models, driven by large-scale multimodal datasets. However, for audio representation learning, existing datasets suffer from limitations in the following aspects: insufficient volume, simplistic content, and arduous collection procedures. To establish an audio dataset with high-quality captions, we propose an innovative, automatic approach leveraging multimodal inputs, such as video frames, audio streams. Specifically, we construct a large-scale, high-quality, audio-language dataset, named as Auto-ACD, comprising over 1.5M audio-text pairs. We exploit a series of pre-trained models or APIs, to determine audio-visual synchronisation, generate image captions, object detection, or audio tags for specific videos. Subsequently, we employ LLM to paraphrase a congruent caption for each audio, guided by the extracted multi-modality clues. To demonstrate the effectiveness of the proposed dataset, we train widely used models on our dataset and show performance improvement on various downstream tasks, for example, audio-language retrieval, audio captioning, zero-shot classification. In addition, we establish a novel benchmark with environmental information and provide a benchmark for audio-text tasks.

Auto-ACD: A Large-scale Dataset for Audio-Language Representation Learning

TL;DR

Auto-ACD introduces a scalable automatic pipeline that fuses audio, visual, and environmental cues to generate a large-scale audio-text dataset with rich environmental information. By leveraging off-the-shelf vision, language, and audio models and an LLM for paraphrasing, the approach achieves 1.5M audio-text pairs with long, diverse captions, while filtering improves cross-modal coherence. The authors validate the dataset through audio-language retrieval, automatic captioning, and zero-shot environment classification, showing improvements over prior datasets and strong environmental understanding. This data-centric contribution enables more robust audio-language representations and provides a replicable blueprint for future large-scale multimodal datasets and benchmarks.

Abstract

Recently, the AI community has made significant strides in developing powerful foundation models, driven by large-scale multimodal datasets. However, for audio representation learning, existing datasets suffer from limitations in the following aspects: insufficient volume, simplistic content, and arduous collection procedures. To establish an audio dataset with high-quality captions, we propose an innovative, automatic approach leveraging multimodal inputs, such as video frames, audio streams. Specifically, we construct a large-scale, high-quality, audio-language dataset, named as Auto-ACD, comprising over 1.5M audio-text pairs. We exploit a series of pre-trained models or APIs, to determine audio-visual synchronisation, generate image captions, object detection, or audio tags for specific videos. Subsequently, we employ LLM to paraphrase a congruent caption for each audio, guided by the extracted multi-modality clues. To demonstrate the effectiveness of the proposed dataset, we train widely used models on our dataset and show performance improvement on various downstream tasks, for example, audio-language retrieval, audio captioning, zero-shot classification. In addition, we establish a novel benchmark with environmental information and provide a benchmark for audio-text tasks.
Paper Structure (49 sections, 4 equations, 7 figures, 9 tables)

This paper contains 49 sections, 4 equations, 7 figures, 9 tables.

Figures (7)

  • Figure 1: Comparison with other audio caption datasets. "Length" and "# Vocab." refer to average length and vocabulary. "Env." and "Auto." refer to environmental information and automatic pipeline, respectively.
  • Figure 2: Automatic pipeline for Auto-ACD collection. We utilize four open-source computer vision models to extract visual clues from the middle frame of videos, and two open-source audio understanding models to analyze the entirety of the audio content. Consequently, we combine the labels from the original dataset, and leverage Large Language Models (LLMs) to interpret and paraphrase these components into the final description.
  • Figure 3: Detailed prompt provided to ChatGPT. For visualisation purposes, we use different colors to highlight diverse visual-audio cues.
  • Figure 4: Filtering process for AudioSet. We filter the dataset by assessing whether the video and audio are synchronized and analyzing the labels in the original dataset.
  • Figure 5: Audio-language retrieval model and automatic audio captioning model frameworks. Similar to CLIP, the audio-language retrieval model consists of an audio encoder, text encoder, and contrastive loss. The automatic audio captioning model comprises a frozen audio encoder and language model, and a trainable mapping network.
  • ...and 2 more figures