AudioSetMix: Enhancing Audio-Language Datasets with LLM-Assisted Augmentations
David Xu
TL;DR
Audio-language learning is hampered by limited high-quality audio-caption data. This work introduces AudioSetMix, a scalable augmentation pipeline that creates rich, aligned audio-caption pairs by applying controlled audio transformations to AudioSet and generating captions with an LLM, leveraging TS-AudioSet for precise labeling. A four-stage process (preprocessing, augmentation, caption generation, postprocessing) plus hard negative mining yields improved modifier understanding and text-to-audio retrieval performance on standard benchmarks. The results demonstrate that augmenting data with diverse, linguistically richer captions and challenging negatives can push state-of-the-art in audio-language tasks and provide a practical dataset for scalable model training.
Abstract
Multi-modal learning in the audio-language domain has seen significant advancements in recent years. However, audio-language learning faces challenges due to limited and lower-quality data compared to image-language tasks. Existing audio-language datasets are notably smaller, and manual labeling is hindered by the need to listen to entire audio clips for accurate labeling. Our method systematically generates audio-caption pairs by augmenting audio clips with natural language labels and corresponding audio signal processing operations. Leveraging a Large Language Model, we generate descriptions of augmented audio clips with a prompt template. This scalable method produces AudioSetMix, a high-quality training dataset for text-and-audio related models. Integration of our dataset improves models performance on benchmarks by providing diversified and better-aligned examples. Notably, our dataset addresses the absence of modifiers (adjectives and adverbs) in existing datasets. By enabling models to learn these concepts, and generating hard negative examples during training, we achieve state-of-the-art performance on multiple benchmarks.
