Table of Contents
Fetching ...

Cache Mechanism for Agent RAG Systems

Shuhang Lin, Zhencan Peng, Lingyao Li, Xiao Lin, Xi Zhu, Yongfeng Zhang

TL;DR

ARC tackles the high cost of large external knowledge indexes in retrieval-augmented generation by introducing an annotation-free per-agent cache. It fuses dynamic query-driven demand via Distance--Rank Frequency (DRF) with a space-aware hubness centrality to form a principled cache priority, while enforcing a fixed cache budget. Empirical results on a 6.4M-document Wikipedia index and three QA datasets show ARC achieves up to 79.8% has-answer rate while using only 0.015% of the original storage and reducing average retrieval latency by about 80%, outperforming standard caching baselines. The approach highlights the practical impact of embedding-space geometry for data-efficient, latency-sensitive RAG systems in edge- and bandwidth-constrained settings, with clear avenues for extending to multi-turn and cross-domain scenarios.

Abstract

Recent advances in Large Language Model (LLM)-based agents have been propelled by Retrieval-Augmented Generation (RAG), which grants the models access to vast external knowledge bases. Despite RAG's success in improving agent performance, agent-level cache management, particularly constructing, maintaining, and updating a compact, relevant corpus dynamically tailored to each agent's need, remains underexplored. Therefore, we introduce ARC (Agent RAG Cache Mechanism), a novel, annotation-free caching framework that dynamically manages small, high-value corpora for each agent. By synthesizing historical query distribution patterns with the intrinsic geometry of cached items in the embedding space, ARC automatically maintains a high-relevance cache. With comprehensive experiments on three retrieval datasets, our experimental results demonstrate that ARC reduces storage requirements to 0.015% of the original corpus while offering up to 79.8% has-answer rate and reducing average retrieval latency by 80%. Our results demonstrate that ARC can drastically enhance efficiency and effectiveness in RAG-powered LLM agents.

Cache Mechanism for Agent RAG Systems

TL;DR

ARC tackles the high cost of large external knowledge indexes in retrieval-augmented generation by introducing an annotation-free per-agent cache. It fuses dynamic query-driven demand via Distance--Rank Frequency (DRF) with a space-aware hubness centrality to form a principled cache priority, while enforcing a fixed cache budget. Empirical results on a 6.4M-document Wikipedia index and three QA datasets show ARC achieves up to 79.8% has-answer rate while using only 0.015% of the original storage and reducing average retrieval latency by about 80%, outperforming standard caching baselines. The approach highlights the practical impact of embedding-space geometry for data-efficient, latency-sensitive RAG systems in edge- and bandwidth-constrained settings, with clear avenues for extending to multi-turn and cross-domain scenarios.

Abstract

Recent advances in Large Language Model (LLM)-based agents have been propelled by Retrieval-Augmented Generation (RAG), which grants the models access to vast external knowledge bases. Despite RAG's success in improving agent performance, agent-level cache management, particularly constructing, maintaining, and updating a compact, relevant corpus dynamically tailored to each agent's need, remains underexplored. Therefore, we introduce ARC (Agent RAG Cache Mechanism), a novel, annotation-free caching framework that dynamically manages small, high-value corpora for each agent. By synthesizing historical query distribution patterns with the intrinsic geometry of cached items in the embedding space, ARC automatically maintains a high-relevance cache. With comprehensive experiments on three retrieval datasets, our experimental results demonstrate that ARC reduces storage requirements to 0.015% of the original corpus while offering up to 79.8% has-answer rate and reducing average retrieval latency by 80%. Our results demonstrate that ARC can drastically enhance efficiency and effectiveness in RAG-powered LLM agents.

Paper Structure

This paper contains 15 sections, 10 equations, 2 figures, 2 tables, 1 algorithm.

Figures (2)

  • Figure 1: The motivations of our research: (a) The comparison between traditional and cached retrieval-augmented agent. (b) Our proposed ARCM schema.
  • Figure 2: Cache performance analysis: (a) Effects of varying cache capacity; (b) Continuous improvement of has-answer rate with streaming queries.