Table of Contents
Fetching ...

Adaptive Multi-Objective Tiered Storage Configuration for KV Cache in LLM Service

Xianzhe Zheng, Zhengheng Wang, Ruiyan Ma, Rui Wang, Xiyu Wang, Rui Chen, Peng Zhang, Sicheng Pan, Zhangheng Huang, Chenxin Wu, Yi Zhang, Bo Cai, Kan Liu, Teng Ma, Yin Du, Dong Deng, Sai Wu, Guoyun Zhu, Wei Zhang, Feifei Li

TL;DR

K Kareto, a KV-cache Adaptive REsource managemenT Optimizer that leverages a diminishing-return-guided pruning method to efficiently navigate the large configuration space and approximate the Pareto frontier, is introduced.

Abstract

The memory-for-computation paradigm of KV caching is essential for accelerating large language model (LLM) inference service, but limited GPU high-bandwidth memory (HBM) capacity motivates offloading the KV cache to cheaper external storage tiers. While this expands capacity, it introduces the challenge of dynamically managing heterogeneous storage resources to balance cost, throughput, and latency under varying workloads. We formulate this as a multi-objective optimization problem: identifying the Pareto frontier across these metrics within the storage configuration space. Using a high-fidelity end-to-end simulator, we observe that the objective functions are non-analytic and exhibit complex variable coupling, making the Pareto frontier difficult to approximate analytically. To obtain the frontier, we introduce Kareto, a KV-cache Adaptive REsource managemenT Optimizer. Kareto leverages a diminishing-return-guided pruning method to efficiently navigate the large configuration space and approximate the Pareto frontier. Additionally, it incorporates a fine-grained adaptive tuner that uses eviction policies in tier storage and KV block access patterns for group-specific cache management, improving cache efficiency. Experiments on real-world traces show that Kareto adapts to workload and can identify configurations of better cost efficiency, covering static strategies. Compared to the fixed setup with 1024 GB DRAM, Kareto can improve throughput by up to 9.3%, or reduce latency by up to 58.3%, or lower cost by up to 20.2% under respective optimization objectives.

Adaptive Multi-Objective Tiered Storage Configuration for KV Cache in LLM Service

TL;DR

K Kareto, a KV-cache Adaptive REsource managemenT Optimizer that leverages a diminishing-return-guided pruning method to efficiently navigate the large configuration space and approximate the Pareto frontier, is introduced.

Abstract

The memory-for-computation paradigm of KV caching is essential for accelerating large language model (LLM) inference service, but limited GPU high-bandwidth memory (HBM) capacity motivates offloading the KV cache to cheaper external storage tiers. While this expands capacity, it introduces the challenge of dynamically managing heterogeneous storage resources to balance cost, throughput, and latency under varying workloads. We formulate this as a multi-objective optimization problem: identifying the Pareto frontier across these metrics within the storage configuration space. Using a high-fidelity end-to-end simulator, we observe that the objective functions are non-analytic and exhibit complex variable coupling, making the Pareto frontier difficult to approximate analytically. To obtain the frontier, we introduce Kareto, a KV-cache Adaptive REsource managemenT Optimizer. Kareto leverages a diminishing-return-guided pruning method to efficiently navigate the large configuration space and approximate the Pareto frontier. Additionally, it incorporates a fine-grained adaptive tuner that uses eviction policies in tier storage and KV block access patterns for group-specific cache management, improving cache efficiency. Experiments on real-world traces show that Kareto adapts to workload and can identify configurations of better cost efficiency, covering static strategies. Compared to the fixed setup with 1024 GB DRAM, Kareto can improve throughput by up to 9.3%, or reduce latency by up to 58.3%, or lower cost by up to 20.2% under respective optimization objectives.
Paper Structure (22 sections, 6 equations, 13 figures, 2 algorithms)

This paper contains 22 sections, 6 equations, 13 figures, 2 algorithms.

Figures (13)

  • Figure 4: Design of simulator.
  • Figure 5: Impact of DRAM capacity on reuse ratio, throughput, and mean TTFT.
  • Figure 6: Reuse ratio of DRAM and disk in different workloads varying with capacity. (a) DRAM, traceA, ins1(high-density workload). (b) Disk, traceA, ins1(high-density workload). (c) DRAM, traceA, ins4(low-density workload). (d) Disk, traceA, ins4(low-density workload).
  • Figure 7: Disk bandwidth and reuse ratio in different DISK capacity
  • Figure 8: Performance comparison of three resource allocation strategies under 2 instances: pure DRAM (blue solid), pure DISK (orange dashed), and hybrid with 256 GB DRAM + varying DISK (dark blue dash-dot). The hybrid strategy achieves a balance between cost and latency while maintaining high throughput.
  • ...and 8 more figures