Table of Contents
Fetching ...

Agentic Plan Caching: Test-Time Memory for Fast and Cost-Efficient LLM Agents

Qizheng Zhang, Michael Wornow, Gerry Wan, Kunle Olukotun

TL;DR

Agentic Plan Caching (APC) tackles the high cost and latency of Plan-Act LLM agents by creating a test-time memory that extracts structured plan templates from completed executions. It uses keyword extraction to match new tasks to cached plans and employs lightweight models to adapt templates to task-specific contexts, avoiding expensive re-planning when possible. Across five real workloads and two agent architectures, APC reduces serving costs by 50.31% and latency by 27.28% while preserving 96.61% of optimal performance, with only ~1% overhead for keyword extraction and cache generation. The approach complements existing LLM serving infrastructures and is robust to model variation, offering a practical pathway to more cost-effective agent-enabled AI systems.

Abstract

LLM-based agent applications have shown increasingly remarkable capabilities in complex workflows but incur substantial costs and latency due to extensive planning and reasoning requirements. Existing LLM caching techniques (like context caching and semantic caching), primarily designed for serving chatbots, are insufficient for agent applications where outputs depend on external data and environmental contexts. We propose Agentic Plan Caching (APC), a novel test-time memory that extracts, stores, adapts, and reuses structured plan templates from planning stages of agent applications across semantically similar tasks to reduce the cost and latency of serving. Unlike traditional semantic caching, our system extracts plan templates from completed agent executions at test-time, employs keyword extraction to match new requests against cached plans, and utilizes lightweight models to adapt these templates to task-specific plans with contexts. Evaluation across multiple real-world agent applications shows that our system can reduce costs by 50.31% and latency by 27.28% on average while maintaining performance, offering a more efficient solution for serving LLM-based agents that complements existing LLM serving infrastructures.

Agentic Plan Caching: Test-Time Memory for Fast and Cost-Efficient LLM Agents

TL;DR

Agentic Plan Caching (APC) tackles the high cost and latency of Plan-Act LLM agents by creating a test-time memory that extracts structured plan templates from completed executions. It uses keyword extraction to match new tasks to cached plans and employs lightweight models to adapt templates to task-specific contexts, avoiding expensive re-planning when possible. Across five real workloads and two agent architectures, APC reduces serving costs by 50.31% and latency by 27.28% while preserving 96.61% of optimal performance, with only ~1% overhead for keyword extraction and cache generation. The approach complements existing LLM serving infrastructures and is robust to model variation, offering a practical pathway to more cost-effective agent-enabled AI systems.

Abstract

LLM-based agent applications have shown increasingly remarkable capabilities in complex workflows but incur substantial costs and latency due to extensive planning and reasoning requirements. Existing LLM caching techniques (like context caching and semantic caching), primarily designed for serving chatbots, are insufficient for agent applications where outputs depend on external data and environmental contexts. We propose Agentic Plan Caching (APC), a novel test-time memory that extracts, stores, adapts, and reuses structured plan templates from planning stages of agent applications across semantically similar tasks to reduce the cost and latency of serving. Unlike traditional semantic caching, our system extracts plan templates from completed agent executions at test-time, employs keyword extraction to match new requests against cached plans, and utilizes lightweight models to adapt these templates to task-specific plans with contexts. Evaluation across multiple real-world agent applications shows that our system can reduce costs by 50.31% and latency by 27.28% on average while maintaining performance, offering a more efficient solution for serving LLM-based agents that complements existing LLM serving infrastructures.

Paper Structure

This paper contains 53 sections, 5 figures, 11 tables, 3 algorithms.

Figures (5)

  • Figure 1: Plan-Act LLM Applications and Caching Techniques. (a) A typical Plan-Act agent pipeline loop and (b) a comparison of LLM caching methods, with cached components highlighted in yellow.
  • Figure 2: Agentic Plan Caching Framework. We show: (a) cache hit workflow, (b) cache miss workflow, and (c) plan template generation for new cache entries.
  • Figure 3: Query-Based v. Keyword-Based Cache Search. Keyword-based cache search achieves lower levels of false positive and false negative rates than query-based similarity cache search across different thresholds. This suggests that semantic similarity of queries alone may not effectively capture shared task intents and reusable plans.
  • Figure 4: Results across Four Baselines and Agentic Plan Caching.
  • Figure 5: Accuracy Comparison across Caching Methods. While semantic caching with threshold=0.9 in (a) and full-history caching in (b) experience notable accuracy drops during cache hits, agentic plan caching in (c) maintains stable performance across datasets.