Table of Contents
Fetching ...

ARKV: Adaptive and Resource-Efficient KV Cache Management under Limited Memory Budget for Long-Context Inference in LLMs

Jianlong Lei, Shashikant Ilager

TL;DR

This paper proposes ARKV, a lightweight and adaptive framework that dynamically allocates precision levels to cached tokens based on per-layer attention dynamics and token-level importance, and demonstrates the practical viability of ARKV for scalable LLM deployment, offering fine-grained, data-driven memory control.

Abstract

Large Language Models (LLMs) are increasingly deployed in scenarios demanding ultra-long context reasoning, such as agentic workflows and deep research understanding. However, long-context inference is constrained by the KV cache, a transient memory structure that grows linearly with sequence length and batch size, quickly dominating GPU memory usage. Existing memory reduction techniques, including eviction and quantization, often rely on static heuristics and suffer from degraded quality under tight budgets. In this paper, we propose ARKV, a lightweight and adaptive framework that dynamically allocates precision levels to cached tokens based on per-layer attention dynamics and token-level importance. During a short prefill phase, ARKV estimates the original quantization (OQ) ratio of each layer by computing statistical scores such as attention entropy, variance and kurtosis. During decoding, tokens are assigned to one of three states, Original (full precision), Quantization (low precision), or Eviction, according to a fast heavy-hitter scoring strategy. Our experiments on LLaMA3 and Qwen3 models across diverse long- and short-context tasks demonstrate that ARKV preserves ~97% of baseline accuracy on long-context benchmarks while reducing KV memory usage by 4x, with minimal throughput loss. On short-context tasks, ARKV matches full-precision baselines; on GSM8K math reasoning, it significantly outperforms uniform quantization. These results highlight the practical viability of ARKV for scalable LLM deployment, offering fine-grained, data-driven memory control without retraining or architectural modifications. The source code and artifacts can be found in: https://github.com/Large-scale-Sustainable-Computing-LSC/ARKV

ARKV: Adaptive and Resource-Efficient KV Cache Management under Limited Memory Budget for Long-Context Inference in LLMs

TL;DR

This paper proposes ARKV, a lightweight and adaptive framework that dynamically allocates precision levels to cached tokens based on per-layer attention dynamics and token-level importance, and demonstrates the practical viability of ARKV for scalable LLM deployment, offering fine-grained, data-driven memory control.

Abstract

Large Language Models (LLMs) are increasingly deployed in scenarios demanding ultra-long context reasoning, such as agentic workflows and deep research understanding. However, long-context inference is constrained by the KV cache, a transient memory structure that grows linearly with sequence length and batch size, quickly dominating GPU memory usage. Existing memory reduction techniques, including eviction and quantization, often rely on static heuristics and suffer from degraded quality under tight budgets. In this paper, we propose ARKV, a lightweight and adaptive framework that dynamically allocates precision levels to cached tokens based on per-layer attention dynamics and token-level importance. During a short prefill phase, ARKV estimates the original quantization (OQ) ratio of each layer by computing statistical scores such as attention entropy, variance and kurtosis. During decoding, tokens are assigned to one of three states, Original (full precision), Quantization (low precision), or Eviction, according to a fast heavy-hitter scoring strategy. Our experiments on LLaMA3 and Qwen3 models across diverse long- and short-context tasks demonstrate that ARKV preserves ~97% of baseline accuracy on long-context benchmarks while reducing KV memory usage by 4x, with minimal throughput loss. On short-context tasks, ARKV matches full-precision baselines; on GSM8K math reasoning, it significantly outperforms uniform quantization. These results highlight the practical viability of ARKV for scalable LLM deployment, offering fine-grained, data-driven memory control without retraining or architectural modifications. The source code and artifacts can be found in: https://github.com/Large-scale-Sustainable-Computing-LSC/ARKV
Paper Structure (23 sections, 10 equations, 5 figures, 7 tables, 1 algorithm)

This paper contains 23 sections, 10 equations, 5 figures, 7 tables, 1 algorithm.

Figures (5)

  • Figure 1: Overview of the ARKV framework. During the Prefill Phase, attention statistics are used to compute an Original–Quantization (OQ) ratio and allocate per-layer cache budgets. In the Decoding Phase, tokens are scored by importance and dynamically assigned to Original, Quantization, or Eviction states. The reconstructed KV cache feeds into the attention mechanism with mixed-precision handling to balance memory efficiency and model fidelity.
  • Figure 2: LongBench performance across different models and KV cache strategies. The top panel shows absolute scores, while the bottom panel presents performance relative to the full-cache Base model (normalized to 1.0). Each group includes results for Origin (green), Quant (orange), and ARKV (blue), evaluated at token budgets of 512, 1024, and 2048 (from light to dark). ARKV consistently outperforms quantization-only baselines and closely matches origin-only performance, even under tight memory budgets.
  • Figure 3: GSM8K performance under various KV cache strategies and budgets across different models. The top panel shows absolute accuracy scores, while the bottom panel presents performance relative to the full-cache Base model (normalized to 1.0). Green bars represent origin-only retention, orange bars indicate uniform quantization, and blue bars denote ARKV. Each budget level (512, 1024, 2048 tokens) is visualized from light to dark. ARKV consistently preserves high accuracy across all models, whereas quantization-only methods degrade sharply, especially for small budgets and smaller models.
  • Figure 4: Relative throughput (TPS) across different models under various KV cache strategies and budgets. Each group shows performance of Base, Origin, Quant, and ARKV at budgets of 512, 1024, and 2048 tokens (from light to dark shades). Green bars represent full-precision eviction (Origin), orange bars show uniform quantization (Quant), and blue bars indicate ARKV. ARKV consistently retains high throughput ($\sim$85–88%) across models while enforcing cache constraints.
  • Figure 5: Histograms of FP8 quantization ratios (top) and eviction ratios (bottom) under ARKV. Dashed lines indicate the mean and peak values. Quantization ratios cluster around $\sim$0.14, while eviction ratios vary widely and are strongly influenced by the KV cache budget.