Table of Contents
Fetching ...

VoiceAgentRAG: Solving the RAG Latency Bottleneck in Real-Time Voice Agents Using Dual-Agent Architectures

Jielin Qiu, Jianguo Zhang, Zixiang Chen, Liangwei Yang, Ming Zhu, Juntao Tan, Haolin Chen, Wenting Zhao, Rithesh Murthy, Roshan Ram, Akshara Prabhakar, Shelby Heinecke, Caiming Xiong, Silvio Savarese, Huan Wang

TL;DR

VoiceAgentRAG is presented, an open-source dual-agent memory router that decouples retrieval from response generation and pre-fetches relevant document chunks into a FAISS-backed semantic cache.

Abstract

We present VoiceAgentRAG, an open-source dual-agent memory router that decouples retrieval from response generation. A background Slow Thinker agent continuously monitors the conversation stream, predicts likely follow-up topics using an LLM, and pre-fetches relevant document chunks into a FAISS-backed semantic cache. A foreground Fast Talker agent reads only from this sub-millisecond cache, bypassing the vector database entirely on cache hits.

VoiceAgentRAG: Solving the RAG Latency Bottleneck in Real-Time Voice Agents Using Dual-Agent Architectures

TL;DR

VoiceAgentRAG is presented, an open-source dual-agent memory router that decouples retrieval from response generation and pre-fetches relevant document chunks into a FAISS-backed semantic cache.

Abstract

We present VoiceAgentRAG, an open-source dual-agent memory router that decouples retrieval from response generation. A background Slow Thinker agent continuously monitors the conversation stream, predicts likely follow-up topics using an LLM, and pre-fetches relevant document chunks into a FAISS-backed semantic cache. A foreground Fast Talker agent reads only from this sub-millisecond cache, bypassing the vector database entirely on cache hits.
Paper Structure (31 sections, 1 figure, 6 tables)

This paper contains 31 sections, 1 figure, 6 tables.

Figures (1)

  • Figure 1: Architecture of VoiceAgentRAG. The Slow Thinker (left) continuously monitors the conversation stream, predicts follow-up topics, retrieves from the vector store, and populates the semantic cache. The Fast Talker (right) checks the cache first ($<$1ms), bypassing the vector store on hits. On misses, it falls back to direct retrieval (dashed red) and caches the results for future queries.