Table of Contents
Fetching ...

Cache-Craft: Managing Chunk-Caches for Efficient Retrieval-Augmented Generation

Shubham Agarwal, Sai Sundaresan, Subrata Mitra, Debabrata Mahapatra, Archit Gupta, Rounak Sharma, Nirmal Joshua Kapu, Tong Yu, Shiv Saini

TL;DR

Cache-Craft tackles the prefill bottleneck in Retrieval-Augmented Generation by reusing precomputed KV-caches for knowledge chunks and selectively recomputing a small fraction of tokens to preserve generation quality. It introduces attention-based reuse metrics (Inter/Intra attention, Prefix Overlap Beta, Order Penalty Gamma, and Cache Context Impact) to determine which chunk-caches are reusable and how much recomputation is needed, plus a layered, memory-hierarchical caching pipeline with layer-wise preloading. The approach yields up to 51% reduction in GPU computation versus prefix caching and up to 2–3x improvements in throughput and end-to-end latency under continuous batching, while maintaining high ROUGE F1 scores and favorable user-study results. These results demonstrate practical, scalable KV-cache management for production RAG workloads across large and small LLM variants, enabling faster, cheaper, and equally accurate generation.

Abstract

Retrieval-Augmented Generation (RAG) is often used with Large Language Models (LLMs) to infuse domain knowledge or user-specific information. In RAG, given a user query, a retriever extracts chunks of relevant text from a knowledge base. These chunks are sent to an LLM as part of the input prompt. Typically, any given chunk is repeatedly retrieved across user questions. However, currently, for every question, attention-layers in LLMs fully compute the key values (KVs) repeatedly for the input chunks, as state-of-the-art methods cannot reuse KV-caches when chunks appear at arbitrary locations with arbitrary contexts. Naive reuse leads to output quality degradation. This leads to potentially redundant computations on expensive GPUs and increases latency. In this work, we propose Cache-Craft, a system for managing and reusing precomputed KVs corresponding to the text chunks (we call chunk-caches) in RAG-based systems. We present how to identify chunk-caches that are reusable, how to efficiently perform a small fraction of recomputation to fix the cache to maintain output quality, and how to efficiently store and evict chunk-caches in the hardware for maximizing reuse while masking any overheads. With real production workloads as well as synthetic datasets, we show that Cache-Craft reduces redundant computation by 51% over SOTA prefix-caching and 75% over full recomputation. Additionally, with continuous batching on a real production workload, we get a 1.6X speed up in throughput and a 2X reduction in end-to-end response latency over prefix-caching while maintaining quality, for both the LLaMA-3-8B and LLaMA-3-70B models.

Cache-Craft: Managing Chunk-Caches for Efficient Retrieval-Augmented Generation

TL;DR

Cache-Craft tackles the prefill bottleneck in Retrieval-Augmented Generation by reusing precomputed KV-caches for knowledge chunks and selectively recomputing a small fraction of tokens to preserve generation quality. It introduces attention-based reuse metrics (Inter/Intra attention, Prefix Overlap Beta, Order Penalty Gamma, and Cache Context Impact) to determine which chunk-caches are reusable and how much recomputation is needed, plus a layered, memory-hierarchical caching pipeline with layer-wise preloading. The approach yields up to 51% reduction in GPU computation versus prefix caching and up to 2–3x improvements in throughput and end-to-end latency under continuous batching, while maintaining high ROUGE F1 scores and favorable user-study results. These results demonstrate practical, scalable KV-cache management for production RAG workloads across large and small LLM variants, enabling faster, cheaper, and equally accurate generation.

Abstract

Retrieval-Augmented Generation (RAG) is often used with Large Language Models (LLMs) to infuse domain knowledge or user-specific information. In RAG, given a user query, a retriever extracts chunks of relevant text from a knowledge base. These chunks are sent to an LLM as part of the input prompt. Typically, any given chunk is repeatedly retrieved across user questions. However, currently, for every question, attention-layers in LLMs fully compute the key values (KVs) repeatedly for the input chunks, as state-of-the-art methods cannot reuse KV-caches when chunks appear at arbitrary locations with arbitrary contexts. Naive reuse leads to output quality degradation. This leads to potentially redundant computations on expensive GPUs and increases latency. In this work, we propose Cache-Craft, a system for managing and reusing precomputed KVs corresponding to the text chunks (we call chunk-caches) in RAG-based systems. We present how to identify chunk-caches that are reusable, how to efficiently perform a small fraction of recomputation to fix the cache to maintain output quality, and how to efficiently store and evict chunk-caches in the hardware for maximizing reuse while masking any overheads. With real production workloads as well as synthetic datasets, we show that Cache-Craft reduces redundant computation by 51% over SOTA prefix-caching and 75% over full recomputation. Additionally, with continuous batching on a real production workload, we get a 1.6X speed up in throughput and a 2X reduction in end-to-end response latency over prefix-caching while maintaining quality, for both the LLaMA-3-8B and LLaMA-3-70B models.

Paper Structure

This paper contains 38 sections, 14 equations, 29 figures, 3 tables, 2 algorithms.

Figures (29)

  • Figure 1: Distribution of number tokens in prefill (left) and decode (right phases for two real production RAG systems Sys-X and Sys-Y.
  • Figure 2: Prefill time across prefill length and batch size in vLLM on A100 80GB with TP=4.
  • Figure 3: Chunk-cache hit rate pdf for both Sys-X and RAG datasets.
  • Figure 4: Overview of Cache-Craft
  • Figure 5: Token distribution of different prompt components (Mother prompt, RAG chunks, Examples, Query, etc.) across RAG use cases.
  • ...and 24 more figures