Table of Contents
Fetching ...

Asynchronous Verified Semantic Caching for Tiered LLM Architectures

Asmit Kumar Singh, Haozhe Wang, Laxmi Naga Santosh Attaluri, Tak Chiam, Weihua Zhu

TL;DR

This work tackles the latency and cost burden of large language model inference by enhancing semantic caching with a two-tier static-dynamic design. It introduces Krites, an asynchronous, LLM-judged caching policy that preserves the on-path static-threshold decisions while using an off-path verifier to promote validated static answers into the dynamic cache, via an auxiliary, safe upsert mechanism. Key contributions include a formal grey-zone trigger, a binary LLM judge for equivalence, and an auxiliary overwrite that expands curated static hits without increasing critical-path latency; evaluations on SemCacheLMArena and SemCacheSearchQueries show substantial gains in static-origin coverage (up to $136\%$ and $290\%$, respectively) at fixed error rates. The proposed approach enhances safety and reliability of enterprise-scale LLM deployments by unlocking the value of offline-curated answers for recurring and paraphrased queries, with ROI driven by judicious off-path judging and promotion strategies.

Abstract

Large language models (LLMs) now sit in the critical path of search, assistance, and agentic workflows, making semantic caching essential for reducing inference cost and latency. Production deployments typically use a tiered static-dynamic design: a static cache of curated, offline vetted responses mined from logs, backed by a dynamic cache populated online. In practice, both tiers are commonly governed by a single embedding similarity threshold, which induces a hard tradeoff: conservative thresholds miss safe reuse opportunities, while aggressive thresholds risk serving semantically incorrect responses. We introduce \textbf{Krites}, an asynchronous, LLM-judged caching policy that expands static coverage without changing serving decisions. On the critical path, Krites behaves exactly like a standard static threshold policy. When the nearest static neighbor of the prompt falls just below the static threshold, Krites asynchronously invokes an LLM judge to verify whether the static response is acceptable for the new prompt. Approved matches are promoted into the dynamic cache, allowing future repeats and paraphrases to reuse curated static answers and expanding static reach over time. In trace-driven simulations on conversational and search workloads, Krites increases the fraction of requests served with curated static answers (direct static hits plus verified promotions) by up to $\textbf{3.9}$ times for conversational traffic and search-style queries relative to tuned baselines, with unchanged critical path latency.

Asynchronous Verified Semantic Caching for Tiered LLM Architectures

TL;DR

This work tackles the latency and cost burden of large language model inference by enhancing semantic caching with a two-tier static-dynamic design. It introduces Krites, an asynchronous, LLM-judged caching policy that preserves the on-path static-threshold decisions while using an off-path verifier to promote validated static answers into the dynamic cache, via an auxiliary, safe upsert mechanism. Key contributions include a formal grey-zone trigger, a binary LLM judge for equivalence, and an auxiliary overwrite that expands curated static hits without increasing critical-path latency; evaluations on SemCacheLMArena and SemCacheSearchQueries show substantial gains in static-origin coverage (up to and , respectively) at fixed error rates. The proposed approach enhances safety and reliability of enterprise-scale LLM deployments by unlocking the value of offline-curated answers for recurring and paraphrased queries, with ROI driven by judicious off-path judging and promotion strategies.

Abstract

Large language models (LLMs) now sit in the critical path of search, assistance, and agentic workflows, making semantic caching essential for reducing inference cost and latency. Production deployments typically use a tiered static-dynamic design: a static cache of curated, offline vetted responses mined from logs, backed by a dynamic cache populated online. In practice, both tiers are commonly governed by a single embedding similarity threshold, which induces a hard tradeoff: conservative thresholds miss safe reuse opportunities, while aggressive thresholds risk serving semantically incorrect responses. We introduce \textbf{Krites}, an asynchronous, LLM-judged caching policy that expands static coverage without changing serving decisions. On the critical path, Krites behaves exactly like a standard static threshold policy. When the nearest static neighbor of the prompt falls just below the static threshold, Krites asynchronously invokes an LLM judge to verify whether the static response is acceptable for the new prompt. Approved matches are promoted into the dynamic cache, allowing future repeats and paraphrases to reuse curated static answers and expanding static reach over time. In trace-driven simulations on conversational and search workloads, Krites increases the fraction of requests served with curated static answers (direct static hits plus verified promotions) by up to times for conversational traffic and search-style queries relative to tuned baselines, with unchanged critical path latency.
Paper Structure (28 sections, 3 equations, 2 figures, 1 table, 2 algorithms)

This paper contains 28 sections, 3 equations, 2 figures, 1 table, 2 algorithms.

Figures (2)

  • Figure 1: (a) Hit rate composition for the baseline static threshold policy versus Krites. The overall cache hit rate (total bar height) remains identica. However, Krites significantly increases the proportion of hits served directly from the offline vetted, curated static tier (darker bottom section), strictly improving the safety and quality of served responses.(b) Krites couples the static and dynamic tiers via an off the path LLM judge and auxiliary overwrite, building on the static--dynamic architectures used in web search Fagni2006CachingBaezaYates2008DesignMele2020Topical.
  • Figure 2: Static origin served fraction (including auxiliary-overwrite promotions) as a function of requests processed, starting from a cold dynamic cache, for (a) SemCacheLMArena and (b) SemCacheSearchQueries. Krites increases static-origin coverage over time by populating the dynamic tier with verified pointers to static answers via auxiliary overwrites.