Audio-Language Datasets of Scenes and Events: A Survey
Gijs Wijngaard, Elia Formisano, Michele Esposito, Michel Dumontier
TL;DR
This survey systematically catalogs 69 audio-language datasets used to train ALMs up to September 2024, detailing dataset origins, audio/text characteristics, and usage across tasks such as AAC, ATR, AQA, AG, and LASS. It combines embedding-based analyses (PCA on CLAP) with data-leakage assessments to reveal cross-dataset overlaps and biases, highlighting substantial indoor/human-sound dominance and English-language predominance. The work also discusses data quality challenges, the role of LLMs for caption generation, and the pitfalls of relying on large, noisy datasets, offering concrete recommendations for dataset accessibility, curation, and robust evaluation. Overall, it provides a comprehensive map of current resources and pragmatic guidance for constructing diverse, high-quality audio-language corpora that improve generalization and fair evaluation in ALMs.
Abstract
Audio-language models (ALMs) generate linguistic descriptions of sound-producing events and scenes. Advances in dataset creation and computational power have led to significant progress in this domain. This paper surveys 69 datasets used to train ALMs, covering research up to September 2024 (https://github.com/GLJS/audio-datasets). It provides a comprehensive analysis of datasets origins, audio and linguistic characteristics, and use cases. Key sources include YouTube-based datasets like AudioSet with over two million samples, and community platforms like Freesound with over 1 million samples. Through principal component analysis of audio and text embeddings, the survey evaluates the acoustic and linguistic variability across datasets. It also analyzes data leakage through CLAP embeddings, and examines sound category distributions to identify imbalances. Finally, the survey identifies key challenges in developing large, diverse datasets to enhance ALM performance, including dataset overlap, biases, accessibility barriers, and the predominance of English-language content, while highlighting opportunities for improvement.
