Table of Contents
Fetching ...

Audio-Language Datasets of Scenes and Events: A Survey

Gijs Wijngaard, Elia Formisano, Michele Esposito, Michel Dumontier

TL;DR

This survey systematically catalogs 69 audio-language datasets used to train ALMs up to September 2024, detailing dataset origins, audio/text characteristics, and usage across tasks such as AAC, ATR, AQA, AG, and LASS. It combines embedding-based analyses (PCA on CLAP) with data-leakage assessments to reveal cross-dataset overlaps and biases, highlighting substantial indoor/human-sound dominance and English-language predominance. The work also discusses data quality challenges, the role of LLMs for caption generation, and the pitfalls of relying on large, noisy datasets, offering concrete recommendations for dataset accessibility, curation, and robust evaluation. Overall, it provides a comprehensive map of current resources and pragmatic guidance for constructing diverse, high-quality audio-language corpora that improve generalization and fair evaluation in ALMs.

Abstract

Audio-language models (ALMs) generate linguistic descriptions of sound-producing events and scenes. Advances in dataset creation and computational power have led to significant progress in this domain. This paper surveys 69 datasets used to train ALMs, covering research up to September 2024 (https://github.com/GLJS/audio-datasets). It provides a comprehensive analysis of datasets origins, audio and linguistic characteristics, and use cases. Key sources include YouTube-based datasets like AudioSet with over two million samples, and community platforms like Freesound with over 1 million samples. Through principal component analysis of audio and text embeddings, the survey evaluates the acoustic and linguistic variability across datasets. It also analyzes data leakage through CLAP embeddings, and examines sound category distributions to identify imbalances. Finally, the survey identifies key challenges in developing large, diverse datasets to enhance ALM performance, including dataset overlap, biases, accessibility barriers, and the predominance of English-language content, while highlighting opportunities for improvement.

Audio-Language Datasets of Scenes and Events: A Survey

TL;DR

This survey systematically catalogs 69 audio-language datasets used to train ALMs up to September 2024, detailing dataset origins, audio/text characteristics, and usage across tasks such as AAC, ATR, AQA, AG, and LASS. It combines embedding-based analyses (PCA on CLAP) with data-leakage assessments to reveal cross-dataset overlaps and biases, highlighting substantial indoor/human-sound dominance and English-language predominance. The work also discusses data quality challenges, the role of LLMs for caption generation, and the pitfalls of relying on large, noisy datasets, offering concrete recommendations for dataset accessibility, curation, and robust evaluation. Overall, it provides a comprehensive map of current resources and pragmatic guidance for constructing diverse, high-quality audio-language corpora that improve generalization and fair evaluation in ALMs.

Abstract

Audio-language models (ALMs) generate linguistic descriptions of sound-producing events and scenes. Advances in dataset creation and computational power have led to significant progress in this domain. This paper surveys 69 datasets used to train ALMs, covering research up to September 2024 (https://github.com/GLJS/audio-datasets). It provides a comprehensive analysis of datasets origins, audio and linguistic characteristics, and use cases. Key sources include YouTube-based datasets like AudioSet with over two million samples, and community platforms like Freesound with over 1 million samples. Through principal component analysis of audio and text embeddings, the survey evaluates the acoustic and linguistic variability across datasets. It also analyzes data leakage through CLAP embeddings, and examines sound category distributions to identify imbalances. Finally, the survey identifies key challenges in developing large, diverse datasets to enhance ALM performance, including dataset overlap, biases, accessibility barriers, and the predominance of English-language content, while highlighting opportunities for improvement.
Paper Structure (72 sections, 1 equation, 5 figures, 3 tables)

This paper contains 72 sections, 1 equation, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Audio-Language models: Audio-to-Language models (top row) process audio input to generate text output, while Text-to-Audio models (bottom row) process text input to manipulate or generate audio output.
  • Figure 2: This figure indicates various datasets and which task in the field of Audio-Language Learning the dataset was originally proposed for. The color indicates if the dataset overlaps with another dataset, based on their origin. The background of datasets that consists of multiple colors overlap in multiple origin datasets. The derivative of Clotho are placed under Freesound, and derivatives of AudioCaps under YouTube. Datasets that do not share origin with other datasets are marked with a white background. † = datasets for text-to-audio retrieval are synonymous to datasets for audio-to-text retrieval.
  • Figure 3: Distribution of audio samples across AudioSet's top-200 categories for each dataset stacked vertically, shown on a logarithmic scale. Categories are grouped by the root-level categories of the AudioSet ontology (Natural sounds, Human sounds, etc.), sorted by size and color-coded, with the total number of samples (n) shown below each group. Datasets are sorted by size in the legend.
  • Figure 4: PCA decomposition of all the datasets on both the audio and the language. The stars in the visualisations represent the centroid of the dataset.
  • Figure 5: Heatmap showing the centroid distances and pairwise similarities between audio and text CLAP embeddings of all datasets. The color intensity represents the centroid distance, while the numerical annotations show the percentage of pairs exceeding 50% similarity threshold between datasets. The lower (left) triangle represents audio similarities, and the upper (right) triangle represents text similarities.