Table of Contents
Fetching ...

RAC: Relation-Aware Cache Replacement for Large Language Models

Yuchong Wu, Zihuan Xu, Wangze Ni, Peng Cheng, Lei Chen, Xuemin Lin, Heng Tao Shen, Kui Ren

TL;DR

Relation-Aware Cache is proposed, an online eviction strategy that leverages semantic relations among requests to guide eviction decisions and maintains high effectiveness across diverse workloads, consistently surpassing state-of-the-art baselines by 20%--30% in cache hit ratio.

Abstract

The scaling of Large Language Model (LLM) services faces significant cost and latency challenges, making effective caching under tight capacity crucial. Existing cache replacement policies, from heuristics to learning-based methods, predominantly rely on limited-window statistics such as recency and frequency. We show these signals are not robust for real-world LLM workloads, which exhibit long reuse distances and sparse local recurrence. To address these limitations, we propose Relation-Aware Cache (RAC), an online eviction strategy that leverages semantic relations among requests to guide eviction decisions. RAC synthesizes two relation-aware signals: (1) Topical Prevalence, which aggregates access evidence at the topic level to capture long-horizon reuse; and (2) Structural Importance, which leverages local intra-topic dependency structure to discriminate entries by their future reuse value. Extensive evaluations show that RAC maintains high effectiveness across diverse workloads, consistently surpassing state-of-the-art baselines by 20%--30% in cache hit ratio.

RAC: Relation-Aware Cache Replacement for Large Language Models

TL;DR

Relation-Aware Cache is proposed, an online eviction strategy that leverages semantic relations among requests to guide eviction decisions and maintains high effectiveness across diverse workloads, consistently surpassing state-of-the-art baselines by 20%--30% in cache hit ratio.

Abstract

The scaling of Large Language Model (LLM) services faces significant cost and latency challenges, making effective caching under tight capacity crucial. Existing cache replacement policies, from heuristics to learning-based methods, predominantly rely on limited-window statistics such as recency and frequency. We show these signals are not robust for real-world LLM workloads, which exhibit long reuse distances and sparse local recurrence. To address these limitations, we propose Relation-Aware Cache (RAC), an online eviction strategy that leverages semantic relations among requests to guide eviction decisions. RAC synthesizes two relation-aware signals: (1) Topical Prevalence, which aggregates access evidence at the topic level to capture long-horizon reuse; and (2) Structural Importance, which leverages local intra-topic dependency structure to discriminate entries by their future reuse value. Extensive evaluations show that RAC maintains high effectiveness across diverse workloads, consistently surpassing state-of-the-art baselines by 20%--30% in cache hit ratio.
Paper Structure (19 sections, 1 theorem, 18 equations, 7 figures, 1 table, 5 algorithms)

This paper contains 19 sections, 1 theorem, 18 equations, 7 figures, 1 table, 5 algorithms.

Key Result

proposition 1

For any $\beta\in(0,1)$, the uniform-restart walk is irreducible and aperiodic. Hence the stationary distribution in eq:rw_restart_uniform exists and is unique. Moreover, it can be computed by power iteration: starting from any distribution $r^{(0)}$ over $V$, repeatedly applying the update in eq:rw

Figures (7)

  • Figure 1: Demonstration of traditional, learning-based, and offline-optimal policies on Example 1.
  • Figure 2: Hit ratio on simulated sequences under two stress axes: (a) varying long reuse-distance ratio; (b) varying long-tail coefficient.
  • Figure 3: Normalized hit ratio on timestamp-continuous OASST1 sub-traces under different cache capacities.
  • Figure 5: RQ4: Parameter sensitivity at 10% cache capacity.
  • Figure : (a) Performance vs. capacity
  • ...and 2 more figures

Theorems & Definitions (1)

  • proposition 1: Existence, uniqueness, and computability