HyperRAG: Enhancing Quality-Efficiency Tradeoffs in Retrieval-Augmented Generation with Reranker KV-Cache Reuse
Yuwei An, Yihua Cheng, Seo Jin Park, Junchen Jiang
TL;DR
HyperRAG targets the quality-efficiency dilemma in retrieval-augmented generation by reusing document KV-cache during reranker inference, shifting bottlenecks from GPU compute to storage and bandwidth. The system combines KV-cache compression, a static attention layout, and a CPU-oriented index with LMCache-backed storage to maintain high generation quality while delivering substantial throughput gains. Empirical results show HyperRAG achieves a 2–3x throughput improvement for decoder-based rerankers without sacrificing downstream performance, validating the practicality of KV-cache reuse for large-scale RAG services. The work also discusses deployment strategies and real-world considerations, including storage requirements and architecture for shared KV-cache backends that enable scalable, cost-effective RAG service. The approach has broad implications for deploying high-quality, high-throughput RAG systems in production settings.
Abstract
Retrieval-Augmented Generation (RAG) has emerged as a powerful paradigm for enhancing the performance of large language models (LLMs) by integrating external knowledge into the generation process. A key component of RAG pipelines is the reranker, which selects the most relevant documents from a pool of retrieved candidates and significantly improves the quality of the generated responses. While rerankers refine the selection of retrieved documents in RAG pipelines, they introduce computational challenges that hinder high throughput and low latency. To address this problem, we propose HyperRAG, a system that optimizes the trade-off between quality and efficiency in RAG pipelines by leveraging KV-cache reuse for efficient reranker inference. By reusing document-side KV-cache, HyperRAG achieves both high-quality generation and system-level efficiency. To fully realize the benefits of KV-cache reuse, HyperRAG incorporates a range of system-level optimizations designed to enhance efficiency and scalability. Experiments show that HyperRAG achieves a 2 - 3 throughput improvement with decoder-only rerankers while also delivering higher downstream performance compared with traditional RAG service.
