Table of Contents
Fetching ...

LongAudio-RAG: Event-Grounded Question Answering over Multi-Hour Long Audio

Naveen Vakada, Kartik Hegde, Arvind Krishna Sridhar, Yinyi Guo, Erik Visser

TL;DR

The paper addresses the challenge of answering natural-language questions over multi-hour audio with precise temporal grounding and minimal hallucination. It proposes LongAudio-RAG, a hybrid system that grounds LLM outputs in timestamped acoustic events stored in a SQL log, with an open-vocabulary AGM for edge-based event extraction and a cloud-based LLM for reasoning. The authors introduce a synthetic long-audio benchmark and demonstrate that event-level, constrained evidence improves accuracy over standard RAG and text-to-SQL baselines, while enabling latency-friendly edge-cloud deployment. The approach shows promise for industrial IoT and home audio analytics, achieving strong performance with mid-sized LLMs and outlining a path toward fully on-device reasoning and broader sensing modalities.

Abstract

Long-duration audio is increasingly common in industrial and consumer settings, yet reviewing multi-hour recordings is impractical, motivating systems that answer natural-language queries with precise temporal grounding and minimal hallucination. Existing audio-language models show promise, but long-audio question answering remains difficult due to context-length limits. We introduce LongAudio-RAG (LA-RAG), a hybrid framework that grounds Large Language Model (LLM) outputs in retrieved, timestamped acoustic event detections rather than raw audio. Multi-hour streams are converted into structured event records stored in an SQL database, and at inference time the system resolves natural-language time references, classifies intent, retrieves only the relevant events, and generates answers using this constrained evidence. To evaluate performance, we construct a synthetic long-audio benchmark by concatenating recordings with preserved timestamps and generating template-based question-answer pairs for detection, counting, and summarization tasks. Finally, we demonstrate the practicality of our approach by deploying it in a hybrid edge-cloud environment, where the audio grounding model runs on-device on IoT-class hardware while the LLM is hosted on a GPU-backed server. This architecture enables low-latency event extraction at the edge and high-quality language reasoning in the cloud. Experiments show that structured, event-level retrieval significantly improves accuracy compared to vanilla Retrieval-Augmented Generation (RAG) or text-to-SQL approaches.

LongAudio-RAG: Event-Grounded Question Answering over Multi-Hour Long Audio

TL;DR

The paper addresses the challenge of answering natural-language questions over multi-hour audio with precise temporal grounding and minimal hallucination. It proposes LongAudio-RAG, a hybrid system that grounds LLM outputs in timestamped acoustic events stored in a SQL log, with an open-vocabulary AGM for edge-based event extraction and a cloud-based LLM for reasoning. The authors introduce a synthetic long-audio benchmark and demonstrate that event-level, constrained evidence improves accuracy over standard RAG and text-to-SQL baselines, while enabling latency-friendly edge-cloud deployment. The approach shows promise for industrial IoT and home audio analytics, achieving strong performance with mid-sized LLMs and outlining a path toward fully on-device reasoning and broader sensing modalities.

Abstract

Long-duration audio is increasingly common in industrial and consumer settings, yet reviewing multi-hour recordings is impractical, motivating systems that answer natural-language queries with precise temporal grounding and minimal hallucination. Existing audio-language models show promise, but long-audio question answering remains difficult due to context-length limits. We introduce LongAudio-RAG (LA-RAG), a hybrid framework that grounds Large Language Model (LLM) outputs in retrieved, timestamped acoustic event detections rather than raw audio. Multi-hour streams are converted into structured event records stored in an SQL database, and at inference time the system resolves natural-language time references, classifies intent, retrieves only the relevant events, and generates answers using this constrained evidence. To evaluate performance, we construct a synthetic long-audio benchmark by concatenating recordings with preserved timestamps and generating template-based question-answer pairs for detection, counting, and summarization tasks. Finally, we demonstrate the practicality of our approach by deploying it in a hybrid edge-cloud environment, where the audio grounding model runs on-device on IoT-class hardware while the LLM is hosted on a GPU-backed server. This architecture enables low-latency event extraction at the edge and high-quality language reasoning in the cloud. Experiments show that structured, event-level retrieval significantly improves accuracy compared to vanilla Retrieval-Augmented Generation (RAG) or text-to-SQL approaches.
Paper Structure (37 sections, 13 figures, 5 tables)

This paper contains 37 sections, 13 figures, 5 tables.

Figures (13)

  • Figure 1: Chat Example for LongAudio-RAG
  • Figure 2: LongAudio-RAG (LA-RAG): Proposed method for long audio question answering.
  • Figure 3: Time resolution module
  • Figure 4: Prompt used in time resolution module
  • Figure 5: Query Rephrasing Module
  • ...and 8 more figures