Table of Contents
Fetching ...

AudioRAG: A Challenging Benchmark for Audio Reasoning and Information Retrieval

Jingru Lin, Chen Zhang, Tianrui Wang, Haizhou Li

TL;DR

AudioRAG introduces a challenging benchmark for evaluating multi-hop audio reasoning with external information retrieval in realistic web environments. It combines audio-context questions with retrieval-augmented answers and includes both automatically generated and human-curated data, totaling 500 samples. The study demonstrates that current Large Audio-Language Models struggle on this task, but an agentic pipeline that integrates audio processing and web retrieval provides notable performance gains (up to ~24.9% relative). This work highlights the importance of groundable reasoning for audio tasks and offers a practical baseline and dataset for advancing retrieval-augmented audio reasoning research.

Abstract

Due to recent advancements in Large Audio-Language Models (LALMs) that demonstrate remarkable performance across a range of sound-, speech- and music-related tasks, there is a growing interest in proposing benchmarks to assess these models. Existing benchmarks generally focus only on reasoning with internal knowledge, neglecting real-world scenarios that require external information grounding. To bridge this gap, we introduce AudioRAG, a novel benchmark designed to evaluate audio-based reasoning augmented by information retrieval in realistic web environments. This benchmark comprises both LLM-generated and manually curated question-answer pairs. Our evaluations reveal that even the state-of-the-art LALMs struggle to answer these questions. We therefore propose an agentic pipeline that integrates audio reasoning with retrieval-augmented generation, providing a stronger baseline for future research.

AudioRAG: A Challenging Benchmark for Audio Reasoning and Information Retrieval

TL;DR

AudioRAG introduces a challenging benchmark for evaluating multi-hop audio reasoning with external information retrieval in realistic web environments. It combines audio-context questions with retrieval-augmented answers and includes both automatically generated and human-curated data, totaling 500 samples. The study demonstrates that current Large Audio-Language Models struggle on this task, but an agentic pipeline that integrates audio processing and web retrieval provides notable performance gains (up to ~24.9% relative). This work highlights the importance of groundable reasoning for audio tasks and offers a practical baseline and dataset for advancing retrieval-augmented audio reasoning research.

Abstract

Due to recent advancements in Large Audio-Language Models (LALMs) that demonstrate remarkable performance across a range of sound-, speech- and music-related tasks, there is a growing interest in proposing benchmarks to assess these models. Existing benchmarks generally focus only on reasoning with internal knowledge, neglecting real-world scenarios that require external information grounding. To bridge this gap, we introduce AudioRAG, a novel benchmark designed to evaluate audio-based reasoning augmented by information retrieval in realistic web environments. This benchmark comprises both LLM-generated and manually curated question-answer pairs. Our evaluations reveal that even the state-of-the-art LALMs struggle to answer these questions. We therefore propose an agentic pipeline that integrates audio reasoning with retrieval-augmented generation, providing a stronger baseline for future research.
Paper Structure (17 sections, 1 equation, 2 figures, 5 tables)

This paper contains 17 sections, 1 equation, 2 figures, 5 tables.

Figures (2)

  • Figure 1: Data construction process (left) and the agentic pipeline (right).
  • Figure 2: Error types breakdown for incorrect answers. The x-axis is the category of errors. A refers to Reasoning Error, B refers to Audio Processing Error, C refers to Knowledge Error and D refers to Invalid Answer.