Generative Caching for Structurally Similar Prompts and Responses
Sarthak Chakraborty, Suman Nath, Xuchao Zhang, Chetan Bansal, Indranil Gupta
TL;DR
This work tackles the inefficiency of existing LLM caches in agentive or repetitive-work contexts by introducing GenCache, a variation-aware cache that learns and reuses structural patterns. GenCache forms clusters of structurally similar prompts, synthesizes a Python program per cluster to generate tailored responses, and validates these programs before caching. Empirical results show high cache hit rates and substantial cost and latency savings across synthetic and real-world datasets, with notable improvements in agentic workflows. The approach balances performance with correctness by generating responses locally from cached programs rather than returning static LLM outputs, enabling scalable reuse in structured tasks.
Abstract
Large Language Models (LLMs) are increasingly being used to plan, reason, and execute tasks across diverse scenarios. In use cases like repeatable workflows and agentic settings, prompts are often reused with minor variations while having a similar structure for recurring tasks. This opens up opportunities for caching. However, exact prompt matching fails on such structurally similar prompts, while semantic caching may produce incorrect responses by ignoring critical differences. To address this, we introduce \ourmethod{}, a generative cache that produces variation-aware responses for structurally similar prompts. \ourmethod{} identifies reusable response patterns across similar prompt structures and synthesizes customized outputs for new requests. We show that \ourmethod{} achieves 83\% cache hit rate, while having minimal incorrect hits on datasets without prompt repetition. In agentic workflows, it improves cache hit rate by $\sim$20\% and reduces end-to-end execution latency by $\sim$34\% compared to standard prompt matching.
