Asynchronous Verified Semantic Caching for Tiered LLM Architectures

Asmit Kumar Singh; Haozhe Wang; Laxmi Naga Santosh Attaluri; Tak Chiam; Weihua Zhu

Asynchronous Verified Semantic Caching for Tiered LLM Architectures

Asmit Kumar Singh, Haozhe Wang, Laxmi Naga Santosh Attaluri, Tak Chiam, Weihua Zhu

TL;DR

This work tackles the latency and cost burden of large language model inference by enhancing semantic caching with a two-tier static-dynamic design. It introduces Krites, an asynchronous, LLM-judged caching policy that preserves the on-path static-threshold decisions while using an off-path verifier to promote validated static answers into the dynamic cache, via an auxiliary, safe upsert mechanism. Key contributions include a formal grey-zone trigger, a binary LLM judge for equivalence, and an auxiliary overwrite that expands curated static hits without increasing critical-path latency; evaluations on SemCacheLMArena and SemCacheSearchQueries show substantial gains in static-origin coverage (up to $136\%$ and $290\%$, respectively) at fixed error rates. The proposed approach enhances safety and reliability of enterprise-scale LLM deployments by unlocking the value of offline-curated answers for recurring and paraphrased queries, with ROI driven by judicious off-path judging and promotion strategies.

Abstract

Large language models (LLMs) now sit in the critical path of search, assistance, and agentic workflows, making semantic caching essential for reducing inference cost and latency. Production deployments typically use a tiered static-dynamic design: a static cache of curated, offline vetted responses mined from logs, backed by a dynamic cache populated online. In practice, both tiers are commonly governed by a single embedding similarity threshold, which induces a hard tradeoff: conservative thresholds miss safe reuse opportunities, while aggressive thresholds risk serving semantically incorrect responses. We introduce \textbf{Krites}, an asynchronous, LLM-judged caching policy that expands static coverage without changing serving decisions. On the critical path, Krites behaves exactly like a standard static threshold policy. When the nearest static neighbor of the prompt falls just below the static threshold, Krites asynchronously invokes an LLM judge to verify whether the static response is acceptable for the new prompt. Approved matches are promoted into the dynamic cache, allowing future repeats and paraphrases to reuse curated static answers and expanding static reach over time. In trace-driven simulations on conversational and search workloads, Krites increases the fraction of requests served with curated static answers (direct static hits plus verified promotions) by up to $\textbf{3.9}$ times for conversational traffic and search-style queries relative to tuned baselines, with unchanged critical path latency.

Asynchronous Verified Semantic Caching for Tiered LLM Architectures

TL;DR

and

, respectively) at fixed error rates. The proposed approach enhances safety and reliability of enterprise-scale LLM deployments by unlocking the value of offline-curated answers for recurring and paraphrased queries, with ROI driven by judicious off-path judging and promotion strategies.

Abstract

times for conversational traffic and search-style queries relative to tuned baselines, with unchanged critical path latency.

Paper Structure (28 sections, 3 equations, 2 figures, 1 table, 2 algorithms)

This paper contains 28 sections, 3 equations, 2 figures, 1 table, 2 algorithms.

Introduction
Background and system model
Vector embeddings and semantic search
Tiered static-dynamic architecture
Static tier $C_{\text{static}}$
Dynamic tier $C_{\text{dynamic}}$
Why tiering improves quality.
Agentic backend $B$
Baseline GPTCache style policy
Krites architecture
Grey-zone trigger and task scheduling
LLM judge requirements
Task.
Auxiliary overwrite semantics
Cost and capacity considerations
...and 13 more sections

Figures (2)

Figure 1: (a) Hit rate composition for the baseline static threshold policy versus Krites. The overall cache hit rate (total bar height) remains identica. However, Krites significantly increases the proportion of hits served directly from the offline vetted, curated static tier (darker bottom section), strictly improving the safety and quality of served responses.(b) Krites couples the static and dynamic tiers via an off the path LLM judge and auxiliary overwrite, building on the static--dynamic architectures used in web search Fagni2006CachingBaezaYates2008DesignMele2020Topical.
Figure 2: Static origin served fraction (including auxiliary-overwrite promotions) as a function of requests processed, starting from a cold dynamic cache, for (a) SemCacheLMArena and (b) SemCacheSearchQueries. Krites increases static-origin coverage over time by populating the dynamic tier with verified pointers to static answers via auxiliary overwrites.

Asynchronous Verified Semantic Caching for Tiered LLM Architectures

TL;DR

Abstract

Asynchronous Verified Semantic Caching for Tiered LLM Architectures

Authors

TL;DR

Abstract

Table of Contents

Figures (2)