Table of Contents
Fetching ...

HyperRAG: Enhancing Quality-Efficiency Tradeoffs in Retrieval-Augmented Generation with Reranker KV-Cache Reuse

Yuwei An, Yihua Cheng, Seo Jin Park, Junchen Jiang

TL;DR

HyperRAG targets the quality-efficiency dilemma in retrieval-augmented generation by reusing document KV-cache during reranker inference, shifting bottlenecks from GPU compute to storage and bandwidth. The system combines KV-cache compression, a static attention layout, and a CPU-oriented index with LMCache-backed storage to maintain high generation quality while delivering substantial throughput gains. Empirical results show HyperRAG achieves a 2–3x throughput improvement for decoder-based rerankers without sacrificing downstream performance, validating the practicality of KV-cache reuse for large-scale RAG services. The work also discusses deployment strategies and real-world considerations, including storage requirements and architecture for shared KV-cache backends that enable scalable, cost-effective RAG service. The approach has broad implications for deploying high-quality, high-throughput RAG systems in production settings.

Abstract

Retrieval-Augmented Generation (RAG) has emerged as a powerful paradigm for enhancing the performance of large language models (LLMs) by integrating external knowledge into the generation process. A key component of RAG pipelines is the reranker, which selects the most relevant documents from a pool of retrieved candidates and significantly improves the quality of the generated responses. While rerankers refine the selection of retrieved documents in RAG pipelines, they introduce computational challenges that hinder high throughput and low latency. To address this problem, we propose HyperRAG, a system that optimizes the trade-off between quality and efficiency in RAG pipelines by leveraging KV-cache reuse for efficient reranker inference. By reusing document-side KV-cache, HyperRAG achieves both high-quality generation and system-level efficiency. To fully realize the benefits of KV-cache reuse, HyperRAG incorporates a range of system-level optimizations designed to enhance efficiency and scalability. Experiments show that HyperRAG achieves a 2 - 3 throughput improvement with decoder-only rerankers while also delivering higher downstream performance compared with traditional RAG service.

HyperRAG: Enhancing Quality-Efficiency Tradeoffs in Retrieval-Augmented Generation with Reranker KV-Cache Reuse

TL;DR

HyperRAG targets the quality-efficiency dilemma in retrieval-augmented generation by reusing document KV-cache during reranker inference, shifting bottlenecks from GPU compute to storage and bandwidth. The system combines KV-cache compression, a static attention layout, and a CPU-oriented index with LMCache-backed storage to maintain high generation quality while delivering substantial throughput gains. Empirical results show HyperRAG achieves a 2–3x throughput improvement for decoder-based rerankers without sacrificing downstream performance, validating the practicality of KV-cache reuse for large-scale RAG services. The work also discusses deployment strategies and real-world considerations, including storage requirements and architecture for shared KV-cache backends that enable scalable, cost-effective RAG service. The approach has broad implications for deploying high-quality, high-throughput RAG systems in production settings.

Abstract

Retrieval-Augmented Generation (RAG) has emerged as a powerful paradigm for enhancing the performance of large language models (LLMs) by integrating external knowledge into the generation process. A key component of RAG pipelines is the reranker, which selects the most relevant documents from a pool of retrieved candidates and significantly improves the quality of the generated responses. While rerankers refine the selection of retrieved documents in RAG pipelines, they introduce computational challenges that hinder high throughput and low latency. To address this problem, we propose HyperRAG, a system that optimizes the trade-off between quality and efficiency in RAG pipelines by leveraging KV-cache reuse for efficient reranker inference. By reusing document-side KV-cache, HyperRAG achieves both high-quality generation and system-level efficiency. To fully realize the benefits of KV-cache reuse, HyperRAG incorporates a range of system-level optimizations designed to enhance efficiency and scalability. Experiments show that HyperRAG achieves a 2 - 3 throughput improvement with decoder-only rerankers while also delivering higher downstream performance compared with traditional RAG service.

Paper Structure

This paper contains 22 sections, 6 figures, 1 table.

Figures (6)

  • Figure 1: Classic RAG Workflow: The query is embedded and used to retrieve top-K documents. Then the reranker selects the most relevant ones which are combined with the query to generate the final response.
  • Figure 2: RAG downstream performance with different rerankers: Subfigures a, b, and c show the performance curves of exact match (EM) scores on TriviaQA joshi2017triviaqalargescaledistantly, NaturalQA nq, and PopQA popqa with various rerankers. The x-axis denotes the number of retrieved documents involved during reranking from which the top-1 document is selected for generation. D-BOUND represents the performance upper bound is based on the number of documents during rerank while M-BOUND reflects that the upper bound is determined by the reranker’s ability to identify the most relevant document. The generation model is meta-llama/Llama-3.1-8B-Instruct. The five labels represent different configurations: Baseline (No RAG), Embedding-only (retrieves the top document directly using cosine similarity), E/MINILM (uses the ms-marco-MiniLM-L6-v2 reranker which is Encoder-only cross-encoder-ms-marco-MiniLM-L6-v2), E/BGEM (uses the bge-reranker-v2-m3 reranker which is Encoder-only bge-reranker-v2-m3), and D/GEMMA (uses the Gemma 2B reranker which is Decoder-only bge-reranker-v2-gemma).
  • Figure 3: Efficiency Observations for the Reranker: Subfigure \ref{['fig:reuse_a']} illustrates the trade-off between latency and NaturalQA performance across different reranker models. Subfigure \ref{['fig:reuse_b']} presents the latency and throughput of the Gemma-2B reranker under varying document chunk sizes, with the query chunk size fixed at 48. The blue line indicates full computation, while the orange line represents computation with KV-cache reuse. Solid lines denote latency, and dashed lines denote throughput. Subfigure \ref{['fig:reuse_c']} shows the memory footprint of the Gemma-2B reranker during inference with different batch sizes, using a fixed input length of $256 + 48 = 304$. Subfigure \ref{['fig:reuse_d']} highlights how throughput increases with larger batch sizes up until an out-of-memory (OOM) error occurs for Gemma-2B reranker model inference.
  • Figure 4: Overview of HyperRAG
  • Figure 5: Static KV Layout: During reranking, we allocate a fixed-length KV buffer for attention. The buffer consists of a static document segment (retrieved KV, shown in red) and a static query segment (computed KV, shown in green).
  • ...and 1 more figures