VoiceAgentRAG: Solving the RAG Latency Bottleneck in Real-Time Voice Agents Using Dual-Agent Architectures

Jielin Qiu; Jianguo Zhang; Zixiang Chen; Liangwei Yang; Ming Zhu; Juntao Tan; Haolin Chen; Wenting Zhao; Rithesh Murthy; Roshan Ram; Akshara Prabhakar; Shelby Heinecke; Caiming Xiong; Silvio Savarese; Huan Wang

VoiceAgentRAG: Solving the RAG Latency Bottleneck in Real-Time Voice Agents Using Dual-Agent Architectures

Jielin Qiu, Jianguo Zhang, Zixiang Chen, Liangwei Yang, Ming Zhu, Juntao Tan, Haolin Chen, Wenting Zhao, Rithesh Murthy, Roshan Ram, Akshara Prabhakar, Shelby Heinecke, Caiming Xiong, Silvio Savarese, Huan Wang

TL;DR

VoiceAgentRAG is presented, an open-source dual-agent memory router that decouples retrieval from response generation and pre-fetches relevant document chunks into a FAISS-backed semantic cache.

Abstract

We present VoiceAgentRAG, an open-source dual-agent memory router that decouples retrieval from response generation. A background Slow Thinker agent continuously monitors the conversation stream, predicts likely follow-up topics using an LLM, and pre-fetches relevant document chunks into a FAISS-backed semantic cache. A foreground Fast Talker agent reads only from this sub-millisecond cache, bypassing the vector database entirely on cache hits.

VoiceAgentRAG: Solving the RAG Latency Bottleneck in Real-Time Voice Agents Using Dual-Agent Architectures

TL;DR

VoiceAgentRAG is presented, an open-source dual-agent memory router that decouples retrieval from response generation and pre-fetches relevant document chunks into a FAISS-backed semantic cache.

Abstract

Paper Structure (31 sections, 1 figure, 6 tables)

This paper contains 31 sections, 1 figure, 6 tables.

Introduction
Method
Architecture Overview
Semantic Cache
Slow Thinker: Predictive Prefetching
Fast Talker: Cache-First Response
Conversation Stream
Experimental Setup
Knowledge Base
Vector Store
LLM
Conversation Scenarios
Evaluation Protocol
Results
Overall Performance
...and 16 more sections

Figures (1)

Figure 1: Architecture of VoiceAgentRAG. The Slow Thinker (left) continuously monitors the conversation stream, predicts follow-up topics, retrieves from the vector store, and populates the semantic cache. The Fast Talker (right) checks the cache first ($<$1ms), bypassing the vector store on hits. On misses, it falls back to direct retrieval (dashed red) and caches the results for future queries.

VoiceAgentRAG: Solving the RAG Latency Bottleneck in Real-Time Voice Agents Using Dual-Agent Architectures

TL;DR

Abstract

VoiceAgentRAG: Solving the RAG Latency Bottleneck in Real-Time Voice Agents Using Dual-Agent Architectures

Authors

TL;DR

Abstract

Table of Contents

Figures (1)