Table of Contents
Fetching ...

AdaptCache: KV Cache Native Storage Hierarchy for Low-Delay and High-Quality Language Model Serving

Shaoting Feng, Hanchen Li, Kuntai Du, Zhuohan Gu, Yuhan Liu, Jiayi Yao, Siddhant Ray, Samuel Shen, Yihua Cheng, Ganesh Ananthanarayanan, Junchen Jiang

TL;DR

AdaptCache tackles the growing size of KV caches in LLM serving by introducing a lossy KV-cache storage system over a DRAM-SSD hierarchy. It comprises an offline estimator, a policy optimizer, and an executor to adaptively select per-entry compression algorithm, rate, and device placement, guided by the utility $Utility(i)=Freq(i)\cdot(\alpha\times Quality(i, M_i, R_i) - \frac{size(i, M_i, R_i)}{Bandwidth})$ and solved as a greedy approximation to the NP-hard Multi-Choice Knapsack Problem using marginal utility drops. Evaluated on LongBench with Llama-3.1-8B-Instruct, AdaptCache achieves 1.43–2.4x delay savings at the same quality and 6–55% quality improvements at the same delay across three tasks, outperforming fixed compression baselines and offloading policies. The approach enables higher DRAM hit rates, lower loading delays, and scalable KV-cache management without significant degradation in generation quality, offering a practical path to efficient LLM serving in datacenters. Overall, AdaptCache demonstrates how device-aware lossy compression and per-entry optimization can substantially improve latency and throughput in real-time LLM inference.

Abstract

Large language model (LLM) applications often reuse previously processed context, such as chat history and documents, which introduces significant redundant computation. Existing LLM serving systems address such redundant computation by storing the KV caches of processed context and loading the corresponding KV cache when a new request reuses the context. Further, as these LLM applications scale, the total size of KV caches becomes excessively large and requires both DRAM and SSD for full storage. However, prior work that stores KV caches in DRAM and SSD suffers from high loading delays, as most KV cache hits come from SSD, which is slow to load. To increase the KV cache hit rate on DRAM, we identify lossy KV cache compression as a promising approach. We design a lossy compression system that decides the compression algorithm, compression rate and device placement for each KV cache entry to maximise DRAM hits and minimise loading delay without significantly degrading generation quality. Compared to various static compression baselines across three tasks, our system AdaptCache achieves 1.43--2.4 x delay savings at the same quality and 6--55% quality improvements at the same delay.

AdaptCache: KV Cache Native Storage Hierarchy for Low-Delay and High-Quality Language Model Serving

TL;DR

AdaptCache tackles the growing size of KV caches in LLM serving by introducing a lossy KV-cache storage system over a DRAM-SSD hierarchy. It comprises an offline estimator, a policy optimizer, and an executor to adaptively select per-entry compression algorithm, rate, and device placement, guided by the utility and solved as a greedy approximation to the NP-hard Multi-Choice Knapsack Problem using marginal utility drops. Evaluated on LongBench with Llama-3.1-8B-Instruct, AdaptCache achieves 1.43–2.4x delay savings at the same quality and 6–55% quality improvements at the same delay across three tasks, outperforming fixed compression baselines and offloading policies. The approach enables higher DRAM hit rates, lower loading delays, and scalable KV-cache management without significant degradation in generation quality, offering a practical path to efficient LLM serving in datacenters. Overall, AdaptCache demonstrates how device-aware lossy compression and per-entry optimization can substantially improve latency and throughput in real-time LLM inference.

Abstract

Large language model (LLM) applications often reuse previously processed context, such as chat history and documents, which introduces significant redundant computation. Existing LLM serving systems address such redundant computation by storing the KV caches of processed context and loading the corresponding KV cache when a new request reuses the context. Further, as these LLM applications scale, the total size of KV caches becomes excessively large and requires both DRAM and SSD for full storage. However, prior work that stores KV caches in DRAM and SSD suffers from high loading delays, as most KV cache hits come from SSD, which is slow to load. To increase the KV cache hit rate on DRAM, we identify lossy KV cache compression as a promising approach. We design a lossy compression system that decides the compression algorithm, compression rate and device placement for each KV cache entry to maximise DRAM hits and minimise loading delay without significantly degrading generation quality. Compared to various static compression baselines across three tasks, our system AdaptCache achieves 1.43--2.4 x delay savings at the same quality and 6--55% quality improvements at the same delay.

Paper Structure

This paper contains 3 sections, 2 equations, 2 figures.

Figures (2)

  • Figure 1: AdaptCache increases cache hit in high-speed device and reduces loading time in low-speed device while maintaining high generation quality
  • Figure 2: AdaptCache achieves 1.43--2.4 $\times$ lower TTFT and 6--55% higher quality compared to fixed compression method and rate