Table of Contents
Fetching ...

WavRAG: Audio-Integrated Retrieval Augmented Generation for Spoken Dialogue Models

Yifu Chen, Shengpeng Ji, Haoxiao Wang, Ziqing Wang, Siyu Chen, Jinzheng He, Jin Xu, Zhou Zhao

TL;DR

WavRAG introduces an end-to-end, audio-native retrieval augmented generation framework that directly processes raw audio for embedding and retrieval, addressing the limitations of ASR-based pipelines. It builds a unified text–audio knowledge base and deploys a contrastively trained WavRetriever, grounded in Qwen2-Audio, to produce multimodal embeddings, enabling efficient top-k retrieval via cosine similarity. The generation stage incorporates Zero-Shot-CoT reasoning and a Self-Consistency mechanism to improve reliability and grounding in multimodal knowledge. Across multiple retrieval and generation benchmarks, WavRAG achieves competitive retrieval performance with 5–14x speedups and notable gains from CoT, while extending RAG capabilities to the audio modality and outperforming text-only baselines on multimodal tasks. Human evaluations confirm high-quality knowledge extension, though the authors note open questions about leveraging acoustic aspects like prosody and emotion in future work.

Abstract

Retrieval Augmented Generation (RAG) has gained widespread adoption owing to its capacity to empower large language models (LLMs) to integrate external knowledge. However, existing RAG frameworks are primarily designed for text-based LLMs and rely on Automatic Speech Recognition to process speech input, which discards crucial audio information, risks transcription errors, and increases computational overhead. Therefore, we introduce WavRAG, the first retrieval augmented generation framework with native, end-to-end audio support. WavRAG offers two key features: 1) Bypassing ASR, WavRAG directly processes raw audio for both embedding and retrieval. 2) WavRAG integrates audio and text into a unified knowledge representation. Specifically, we propose the WavRetriever to facilitate the retrieval from a text-audio hybrid knowledge base, and further enhance the in-context capabilities of spoken dialogue models through the integration of chain-of-thought reasoning. In comparison to state-of-the-art ASR-Text RAG pipelines, WavRAG achieves comparable retrieval performance while delivering a 10x acceleration. Furthermore, WavRAG's unique text-audio hybrid retrieval capability extends the boundaries of RAG to the audio modality.

WavRAG: Audio-Integrated Retrieval Augmented Generation for Spoken Dialogue Models

TL;DR

WavRAG introduces an end-to-end, audio-native retrieval augmented generation framework that directly processes raw audio for embedding and retrieval, addressing the limitations of ASR-based pipelines. It builds a unified text–audio knowledge base and deploys a contrastively trained WavRetriever, grounded in Qwen2-Audio, to produce multimodal embeddings, enabling efficient top-k retrieval via cosine similarity. The generation stage incorporates Zero-Shot-CoT reasoning and a Self-Consistency mechanism to improve reliability and grounding in multimodal knowledge. Across multiple retrieval and generation benchmarks, WavRAG achieves competitive retrieval performance with 5–14x speedups and notable gains from CoT, while extending RAG capabilities to the audio modality and outperforming text-only baselines on multimodal tasks. Human evaluations confirm high-quality knowledge extension, though the authors note open questions about leveraging acoustic aspects like prosody and emotion in future work.

Abstract

Retrieval Augmented Generation (RAG) has gained widespread adoption owing to its capacity to empower large language models (LLMs) to integrate external knowledge. However, existing RAG frameworks are primarily designed for text-based LLMs and rely on Automatic Speech Recognition to process speech input, which discards crucial audio information, risks transcription errors, and increases computational overhead. Therefore, we introduce WavRAG, the first retrieval augmented generation framework with native, end-to-end audio support. WavRAG offers two key features: 1) Bypassing ASR, WavRAG directly processes raw audio for both embedding and retrieval. 2) WavRAG integrates audio and text into a unified knowledge representation. Specifically, we propose the WavRetriever to facilitate the retrieval from a text-audio hybrid knowledge base, and further enhance the in-context capabilities of spoken dialogue models through the integration of chain-of-thought reasoning. In comparison to state-of-the-art ASR-Text RAG pipelines, WavRAG achieves comparable retrieval performance while delivering a 10x acceleration. Furthermore, WavRAG's unique text-audio hybrid retrieval capability extends the boundaries of RAG to the audio modality.

Paper Structure

This paper contains 35 sections, 5 equations, 6 figures, 10 tables.

Figures (6)

  • Figure 1: Dialogue examples illustrating WavRAG's ability to understand queries and generate appropriate responses by retrieving and augment relevant diverse modality knowledge
  • Figure 2: Architecture of the WavRAG framework. Top: Traditional RAG pipeline using ASR, highlighting its limitations. Bottom: WavRAG's four-step process: (1) A dual-modality encoder creates embeddings for both audio and text queries; (2) Top-K documents are retrieved from an audio-text knowledge base using cosine similarity; (3) A chain-of-thought reasoning process analyzes the retrieved information; (4) A large language model generates the final response, grounded in the retrieved knowledge.
  • Figure 3: Architecture of the proposed multimodal retriever, showing the input processing, LLM-based encoding, and knowledge base structure.
  • Figure 4: Human evaluation of knowledge quality. Distributions are shown for Grammatical scores, Factual scores, Relevance scores, and Helpfulness scores. The Helpfulness plot is further broken down by helpfulness level (helpful, neutral, harmful).
  • Figure 5: Prompt for extension
  • ...and 1 more figures