Table of Contents
Fetching ...

Audio Dialogues: Dialogues dataset for audio and music understanding

Arushi Goel, Zhifeng Kong, Rafael Valle, Bryan Catanzaro

TL;DR

Audio Dialogues introduces a large, multi-turn dialogue dataset for general sounds and music to advance audio understanding with LLMs. The authors use a prompting-based GPT-4 pipeline guided by AudioSet-SL and MusicCaps captions to generate 163.8k dialogues and comparison QA, followed by CLAP-based filtration to ensure quality. They demonstrate the value of the dataset by evaluating audio-augmented LLMs (LTU, Qwen-Audio, Audio Flamingo) and show meaningful improvements after fine-tuning, highlighting enhanced dialogue capability and contextual reasoning over audio content. This dataset and pipeline enable more interactive, instruction-following audio models with potential impact on audio retrieval, monitoring, and assistive technologies.

Abstract

Existing datasets for audio understanding primarily focus on single-turn interactions (i.e. audio captioning, audio question answering) for describing audio in natural language, thus limiting understanding audio via interactive dialogue. To address this gap, we introduce Audio Dialogues: a multi-turn dialogue dataset containing 163.8k samples for general audio sounds and music. In addition to dialogues, Audio Dialogues also has question-answer pairs to understand and compare multiple input audios together. Audio Dialogues leverages a prompting-based approach and caption annotations from existing datasets to generate multi-turn dialogues using a Large Language Model (LLM). We evaluate existing audio-augmented large language models on our proposed dataset to demonstrate the complexity and applicability of Audio Dialogues. Our code for generating the dataset will be made publicly available. Detailed prompts and generated dialogues can be found on the demo website https://audiodialogues.github.io/.

Audio Dialogues: Dialogues dataset for audio and music understanding

TL;DR

Audio Dialogues introduces a large, multi-turn dialogue dataset for general sounds and music to advance audio understanding with LLMs. The authors use a prompting-based GPT-4 pipeline guided by AudioSet-SL and MusicCaps captions to generate 163.8k dialogues and comparison QA, followed by CLAP-based filtration to ensure quality. They demonstrate the value of the dataset by evaluating audio-augmented LLMs (LTU, Qwen-Audio, Audio Flamingo) and show meaningful improvements after fine-tuning, highlighting enhanced dialogue capability and contextual reasoning over audio content. This dataset and pipeline enable more interactive, instruction-following audio models with potential impact on audio retrieval, monitoring, and assistive technologies.

Abstract

Existing datasets for audio understanding primarily focus on single-turn interactions (i.e. audio captioning, audio question answering) for describing audio in natural language, thus limiting understanding audio via interactive dialogue. To address this gap, we introduce Audio Dialogues: a multi-turn dialogue dataset containing 163.8k samples for general audio sounds and music. In addition to dialogues, Audio Dialogues also has question-answer pairs to understand and compare multiple input audios together. Audio Dialogues leverages a prompting-based approach and caption annotations from existing datasets to generate multi-turn dialogues using a Large Language Model (LLM). We evaluate existing audio-augmented large language models on our proposed dataset to demonstrate the complexity and applicability of Audio Dialogues. Our code for generating the dataset will be made publicly available. Detailed prompts and generated dialogues can be found on the demo website https://audiodialogues.github.io/.
Paper Structure (9 sections, 2 figures, 4 tables)

This paper contains 9 sections, 2 figures, 4 tables.

Figures (2)

  • Figure 1: Illustration of our data generation pipeline. Audio Dialogues is generated using GPT-4 which takes text-only inputs to generate subsets AudioSet dialogues, Music dialogues and AudioSet comparison subsets of our proposed dataset.
  • Figure 2: LAION-CLAP similarities before filtration for AudioSet Dialogues (left) and Music Dialogues (right).