AQUALLM: Audio Question Answering Data Generation Using Large Language Models

Swarup Ranjan Behera; Krishna Mohan Injeti; Jaya Sai Kiran Patibandla; Praveen Kumar Pokala; Balakrishna Reddy Pailla

AQUALLM: Audio Question Answering Data Generation Using Large Language Models

Swarup Ranjan Behera, Krishna Mohan Injeti, Jaya Sai Kiran Patibandla, Praveen Kumar Pokala, Balakrishna Reddy Pailla

TL;DR

The paper tackles the scarcity of large-scale, high-quality Audio Question Answering data by introducing AQUALLM, an automated, LLM-driven data-generation pipeline that converts audio-caption pairs into extensive AQA datasets. It decomposes the pipeline into Candidate Answer Extraction, Question Generation, Question-Answer Filtering, and Question Paraphrasing modules to produce diverse, verified QA triplets, with a token-level F1 verifier set to a threshold of $0.55$. The authors present three benchmarks—AQUALLM-AudioCaps, AQUALLM-Clotho, and AQUALLM-MACS—that enable state-of-the-art training of AQA models (e.g., MWAFM) with accuracies exceeding $95\%$, significantly outperforming existing datasets. This work delivers scalable data generation, robust benchmarks, and practical resources to accelerate progress in audio-visual QA and cross-modal understanding.

Abstract

Audio Question Answering (AQA) constitutes a pivotal task in which machines analyze both audio signals and natural language questions to produce precise natural language answers. The significance of possessing high-quality, diverse, and extensive AQA datasets cannot be overstated when aiming for the precision of an AQA system. While there has been notable focus on developing accurate and efficient AQA models, the creation of high-quality, diverse, and extensive datasets for the specific task at hand has not garnered considerable attention. To address this challenge, this work makes several contributions. We introduce a scalable AQA data generation pipeline, denoted as the AQUALLM framework, which relies on Large Language Models (LLMs). This framework utilizes existing audio-caption annotations and incorporates state-of-the-art LLMs to generate expansive, high-quality AQA datasets. Additionally, we present three extensive and high-quality benchmark datasets for AQA, contributing significantly to the progression of AQA research. AQA models trained on the proposed datasets set superior benchmarks compared to the existing state-of-the-art. Moreover, models trained on our datasets demonstrate enhanced generalizability when compared to models trained using human-annotated AQA data. Code and datasets will be accessible on GitHub~\footnote{\url{https://github.com/swarupbehera/AQUALLM}}.

AQUALLM: Audio Question Answering Data Generation Using Large Language Models

TL;DR

. The authors present three benchmarks—AQUALLM-AudioCaps, AQUALLM-Clotho, and AQUALLM-MACS—that enable state-of-the-art training of AQA models (e.g., MWAFM) with accuracies exceeding

, significantly outperforming existing datasets. This work delivers scalable data generation, robust benchmarks, and practical resources to accelerate progress in audio-visual QA and cross-modal understanding.

Abstract

Paper Structure (11 sections, 1 figure, 5 tables)

This paper contains 11 sections, 1 figure, 5 tables.

Introduction
Related Work
AQUALLM Framework
Candidate Answer Extraction Module (CAM)
Question Generation Module (QGM)
Question-Answer Filtering Module (QAFM)
Question Paraphrasing Module (QPM)
Experimental Results
AQA Dataset Creation
AQA Model Training and Comparison
Conclusion and Future Work

Figures (1)

Figure 1: The key phases of AQUALLM Framework - CAM: Candidate Answer Extraction Module, QGM: Question Generation Module (LLM), QAFM: Question-Answer Filtering Module, which comprises QAM: Question-Answer Module (LLM) and AVM: Answer Verification Module, and QPM: Question Paraphrasing Module (LLM).

AQUALLM: Audio Question Answering Data Generation Using Large Language Models

TL;DR

Abstract

AQUALLM: Audio Question Answering Data Generation Using Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (1)