Table of Contents
Fetching ...

Generative Caching for Structurally Similar Prompts and Responses

Sarthak Chakraborty, Suman Nath, Xuchao Zhang, Chetan Bansal, Indranil Gupta

TL;DR

This work tackles the inefficiency of existing LLM caches in agentive or repetitive-work contexts by introducing GenCache, a variation-aware cache that learns and reuses structural patterns. GenCache forms clusters of structurally similar prompts, synthesizes a Python program per cluster to generate tailored responses, and validates these programs before caching. Empirical results show high cache hit rates and substantial cost and latency savings across synthetic and real-world datasets, with notable improvements in agentic workflows. The approach balances performance with correctness by generating responses locally from cached programs rather than returning static LLM outputs, enabling scalable reuse in structured tasks.

Abstract

Large Language Models (LLMs) are increasingly being used to plan, reason, and execute tasks across diverse scenarios. In use cases like repeatable workflows and agentic settings, prompts are often reused with minor variations while having a similar structure for recurring tasks. This opens up opportunities for caching. However, exact prompt matching fails on such structurally similar prompts, while semantic caching may produce incorrect responses by ignoring critical differences. To address this, we introduce \ourmethod{}, a generative cache that produces variation-aware responses for structurally similar prompts. \ourmethod{} identifies reusable response patterns across similar prompt structures and synthesizes customized outputs for new requests. We show that \ourmethod{} achieves 83\% cache hit rate, while having minimal incorrect hits on datasets without prompt repetition. In agentic workflows, it improves cache hit rate by $\sim$20\% and reduces end-to-end execution latency by $\sim$34\% compared to standard prompt matching.

Generative Caching for Structurally Similar Prompts and Responses

TL;DR

This work tackles the inefficiency of existing LLM caches in agentive or repetitive-work contexts by introducing GenCache, a variation-aware cache that learns and reuses structural patterns. GenCache forms clusters of structurally similar prompts, synthesizes a Python program per cluster to generate tailored responses, and validates these programs before caching. Empirical results show high cache hit rates and substantial cost and latency savings across synthetic and real-world datasets, with notable improvements in agentic workflows. The approach balances performance with correctness by generating responses locally from cached programs rather than returning static LLM outputs, enabling scalable reuse in structured tasks.

Abstract

Large Language Models (LLMs) are increasingly being used to plan, reason, and execute tasks across diverse scenarios. In use cases like repeatable workflows and agentic settings, prompts are often reused with minor variations while having a similar structure for recurring tasks. This opens up opportunities for caching. However, exact prompt matching fails on such structurally similar prompts, while semantic caching may produce incorrect responses by ignoring critical differences. To address this, we introduce \ourmethod{}, a generative cache that produces variation-aware responses for structurally similar prompts. \ourmethod{} identifies reusable response patterns across similar prompt structures and synthesizes customized outputs for new requests. We show that \ourmethod{} achieves 83\% cache hit rate, while having minimal incorrect hits on datasets without prompt repetition. In agentic workflows, it improves cache hit rate by 20\% and reduces end-to-end execution latency by 34\% compared to standard prompt matching.

Paper Structure

This paper contains 16 sections, 9 figures, 7 tables.

Figures (9)

  • Figure 1: Comparison of GenCache with existing caching techniques. treats both instructions as distinct and results in cache misses, hence uses LLM to generate responses for both. GPTCache encounters cache hit for Instruction 2, but incorrectly returns the already saved response for a similar prompt. GenCache on the other hand, executes the cached program locally on cache hit to generate the correct response tailored to the input
  • Figure 2: GenCache workflow (solid lines: cache reuse, dotted lines: cache generation). \ref{['fig:cache_workflow']} illustrates the runtime workflow--For a new prompt $\mathcal{P}$, the system finds the nearest cluster based on a similarity threshold and checks for an available cache for reuse . If a suitable cache is found, it generates response $\mathcal{R}$ after passing sanity checks . If not, an LLM generates $\mathcal{R}$. In this case, we store ($\mathcal{P}$, $\mathcal{R}$) in the cluster database . Once enough example pairs accumulate in a cluster, CodeGenLLM attempts to generate a program and store it in the cache store after validation. \ref{['fig:codegenllm']} shows the prompt for CodeGenLLM and the generated program.
  • Figure 3: Program validation process before storing it as cache. Step numbering remains consistent with \ref{['fig:cache_workflow']}, and we only show from onwards here. After program generation , ValidLLM validates that the program-generated responses match expected outputs by using in-context examples in the prompt . It produces a boolean-array for each example . If less than $\gamma$ responses match, the system retries cache generation using a reflection-based prompt . Otherwise, it stores the validated program as cache .
  • Figure 4: Ratio of no. of LLM calls used for creating cache to number of cache hits plotted against incoming prompts in time
  • Figure 5: LLM token usage for caching compared against the baseline of 'Without Cache'. With no prompt repetitions, ExactCache incurs same cost as the baseline
  • ...and 4 more figures