Table of Contents
Fetching ...

Why Agent Caching Fails and How to Fix It: Structured Intent Canonicalization with Few-Shot Learning

Abhinaba Basu

TL;DR

This work introduces W5H2, a structured intent decomposition framework and provides risk-controlled selective prediction guarantees via RCPS with nine bound families, and shows existing caching methods fail.

Abstract

Personal AI agents incur substantial cost via repeated LLM calls. We show existing caching methods fail: GPTCache achieves 37.9% accuracy on real benchmarks; APC achieves 0-12%. The root cause is optimizing for the wrong property -- cache effectiveness requires key consistency and precision, not classification accuracy. We observe cache-key evaluation reduces to clustering evaluation and apply V-measure decomposition to separate these on n=8,682 points across MASSIVE, BANKING77, CLINC150, and NyayaBench v2, our new 8,514-entry multilingual agentic dataset (528 intents, 20 W5H2 classes, 63 languages). We introduce W5H2, a structured intent decomposition framework. Using SetFit with 8 examples per class, W5H2 achieves 91.1%+/-1.7% on MASSIVE in ~2ms -- vs 37.9% for GPTCache and 68.8% for a 20B-parameter LLM at 3,447ms. On NyayaBench v2 (20 classes), SetFit achieves 55.3%, with cross-lingual transfer across 30 languages. Our five-tier cascade handles 85% of interactions locally, projecting 97.5% cost reduction. We provide risk-controlled selective prediction guarantees via RCPS with nine bound families.

Why Agent Caching Fails and How to Fix It: Structured Intent Canonicalization with Few-Shot Learning

TL;DR

This work introduces W5H2, a structured intent decomposition framework and provides risk-controlled selective prediction guarantees via RCPS with nine bound families, and shows existing caching methods fail.

Abstract

Personal AI agents incur substantial cost via repeated LLM calls. We show existing caching methods fail: GPTCache achieves 37.9% accuracy on real benchmarks; APC achieves 0-12%. The root cause is optimizing for the wrong property -- cache effectiveness requires key consistency and precision, not classification accuracy. We observe cache-key evaluation reduces to clustering evaluation and apply V-measure decomposition to separate these on n=8,682 points across MASSIVE, BANKING77, CLINC150, and NyayaBench v2, our new 8,514-entry multilingual agentic dataset (528 intents, 20 W5H2 classes, 63 languages). We introduce W5H2, a structured intent decomposition framework. Using SetFit with 8 examples per class, W5H2 achieves 91.1%+/-1.7% on MASSIVE in ~2ms -- vs 37.9% for GPTCache and 68.8% for a 20B-parameter LLM at 3,447ms. On NyayaBench v2 (20 classes), SetFit achieves 55.3%, with cross-lingual transfer across 30 languages. Our five-tier cascade handles 85% of interactions locally, projecting 97.5% cost reduction. We provide risk-controlled selective prediction guarantees via RCPS with nine bound families.
Paper Structure (36 sections, 1 theorem, 5 equations, 10 figures, 15 tables)

This paper contains 36 sections, 1 theorem, 5 equations, 10 figures, 15 tables.

Key Result

Proposition 1

Given $n$ calibration examples, risk tolerance $\alpha \in (0,1)$, failure probability $\delta \in (0,1)$, and a grid of $K$ candidate thresholds, define where $\hat{R}(\tau) = |\{i : \mathrm{conf}(x_i) \geq \tau \wedge f(x_i) \neq \mathrm{intent}(x_i)\}| / n$ is the empirical marginal unsafe rate on the calibration set, and $C(n,K,\delta)$ is a finite-sample correction term. Then $\Pr\!\left(\ma

Figures (10)

  • Figure 1: Overview of W5H2 structured intent canonicalization for agent caching. (a) The embedding similarity trap: semantically similar queries ("check email" vs. "send email," cos$=0.91$) require different tool sequences, while paraphrases of the same intent (cos$=0.65$) fall below typical thresholds. (b) W5H2 decomposes queries into structured fields; the (What, Where) pair forms the cache key. (c) Five-tier cascade architecture with per-tier latency and traffic share. (d) SetFit accuracy on MASSIVE (8-class) across methods: 22M-parameter SetFit (91.1%) outperforms a 20B-parameter LLM (68.8%) at $>$700$\times$ lower latency.
  • Figure 2: Benchmark comparison. (a) Accuracy vs. number of classes (log scale). NyayaBench v2 (20 real agentic classes) is substantially harder than established benchmarks, reflecting the difficulty of real-world intent distributions. (b) V-measure decomposition (homogeneity $h$, completeness $c$, V-measure $V$) across all four benchmarks. SetFit maintains balanced $h$/$c$ even as class count increases.
  • Figure 3: Cross-lingual transfer on NyayaBench v2. (a) Per-language accuracy for 30 languages, colored by language family. Indo-European (Slavic, Romance) and Kartvelian languages transfer best; Dravidian and Niger-Congo remain challenging. (b) Language family summary with mean accuracy and range (whiskers). All training data is English-only (160 examples).
  • Figure 4: Information-theoretic analysis of cache-key quality across three benchmarks.
  • Figure 5: Cache-key quality decomposition. SetFit maximizes mutual information with intents while minimizing conditional entropy in both directions.
  • ...and 5 more figures

Theorems & Definitions (1)

  • Proposition 1: Risk-controlled cache reuse