Table of Contents
Fetching ...

Inference-Time Hyper-Scaling with KV Cache Compression

Adrian Łańcucki, Konrad Staniszewski, Piotr Nawrot, Edoardo M. Ponti

TL;DR

This work tackles the KV-cache bottleneck in Transformer LLMs by introducing inference-time hyper-scaling through KV-cache compression. It presents Dynamic Memory Sparsification (DMS), a trainable, retrofitted eviction-based method that delays eviction and attains up to 8× compression with lightweight training. Across multiple model families and reasoning benchmarks, DMS yields superior accuracy under comparable memory and latency budgets, often outperforming training-free baselines and learned compression methods. The approach offers a practical path to upgrade existing LLMs into more capable reasoners under fixed compute constraints, with demonstrated gains on math, science, and coding tasks and strong throughput benefits.

Abstract

Inference-time scaling trades efficiency for increased reasoning accuracy by generating longer or more parallel sequences. However, in Transformer LLMs, generation cost is bottlenecked by the size of the key-value (KV) cache, rather than the number of generated tokens. Hence, we explore inference-time hyper-scaling: by compressing the KV cache, we can generate more tokens within the same compute budget and further improve the accuracy of scaled inference. The success of this approach, however, hinges on the ability of compression methods to preserve accuracy even at high compression ratios. To make hyper-scaling practical, we introduce Dynamic Memory Sparsification (DMS), a novel method for sparsifying KV caches that only requires 1K training steps to achieve 8$\times$ compression, while maintaining better accuracy than training-free sparse attention. Instead of prematurely discarding cached tokens, DMS delays token eviction, implicitly merging representations and preserving critical information. We demonstrate the effectiveness of inference-time hyper-scaling with DMS on multiple families of LLMs, showing that it boosts accuracy for comparable inference latency and memory load. For instance, we enhance Qwen-R1 32B by 12.0 points on AIME 24, 8.6 on GPQA, and 9.7 on LiveCodeBench on average for an equivalent number of memory reads.

Inference-Time Hyper-Scaling with KV Cache Compression

TL;DR

This work tackles the KV-cache bottleneck in Transformer LLMs by introducing inference-time hyper-scaling through KV-cache compression. It presents Dynamic Memory Sparsification (DMS), a trainable, retrofitted eviction-based method that delays eviction and attains up to 8× compression with lightweight training. Across multiple model families and reasoning benchmarks, DMS yields superior accuracy under comparable memory and latency budgets, often outperforming training-free baselines and learned compression methods. The approach offers a practical path to upgrade existing LLMs into more capable reasoners under fixed compute constraints, with demonstrated gains on math, science, and coding tasks and strong throughput benefits.

Abstract

Inference-time scaling trades efficiency for increased reasoning accuracy by generating longer or more parallel sequences. However, in Transformer LLMs, generation cost is bottlenecked by the size of the key-value (KV) cache, rather than the number of generated tokens. Hence, we explore inference-time hyper-scaling: by compressing the KV cache, we can generate more tokens within the same compute budget and further improve the accuracy of scaled inference. The success of this approach, however, hinges on the ability of compression methods to preserve accuracy even at high compression ratios. To make hyper-scaling practical, we introduce Dynamic Memory Sparsification (DMS), a novel method for sparsifying KV caches that only requires 1K training steps to achieve 8 compression, while maintaining better accuracy than training-free sparse attention. Instead of prematurely discarding cached tokens, DMS delays token eviction, implicitly merging representations and preserving critical information. We demonstrate the effectiveness of inference-time hyper-scaling with DMS on multiple families of LLMs, showing that it boosts accuracy for comparable inference latency and memory load. For instance, we enhance Qwen-R1 32B by 12.0 points on AIME 24, 8.6 on GPQA, and 9.7 on LiveCodeBench on average for an equivalent number of memory reads.

Paper Structure

This paper contains 42 sections, 6 equations, 9 figures, 12 tables.

Figures (9)

  • Figure 1: Average absolute gains of DMS over the original LLMs during inference-time scaling on reasoning tasks for the same KV cache memory reads, (a proxy for latency).
  • Figure 2: During each inference step ( left) the incoming key--value pair $(\mathbf{k}_t, \mathbf{v}_t)$ might be selected for later eviction, based on predicted binary decisions $\alpha^{\text{bin}}\in\{0,1\}$ (we show only a sequence of keys for clarity). The eviction takes place as soon as the pair falls out of the sliding window. During training ( right), this behavior is induced with an additive attention mask. Eviction decisions are relaxed from binary to continuous $\alpha \in [0,1]$.
  • Figure 3: Inference-time scaling results comparing exact-match accuracy ($y$-axis) against performance metrics ($x$-axis). Point colors indicate the compression algorithm used, shapes the compression ratio, and W–L labels denote the scaling strategy (W: number of sampled reasoning threads; L: sequence length). Colored lines indicate the respective Pareto frontiers. The horizontal black lines mark the accuracy reported by guo2025deepseek for the 1–32K vanilla model. Top: A comparison in terms of KV-cache token reads, used as an implementation-agnostic proxy for attention compute. Middle: A comparison in terms of the peak number of tokens in memory, reflecting memory load. Bottom: Throughput calculated at the maximum batch size that accommodates the corresponding W–L configuration. Across plots, DMS attains the best Pareto frontiers, indicating that KV-cache compression is an effective strategy for improving inference-time scaling.
  • Figure 4: Latency of models ($y$-axis) at different context lengths ($x$-axis). Top: We compare the effect of different model sizes (Qwen-R1 1.5B, 7B, 32B) for the same batch size (32). Bottom: We compare the effect of different batch sizes (32, 64, 128) for the same model (Qwen3-8B). Batch size reflects both the number of parallel reasoning threads and the number of queries the model is serving. These plots show that inference becomes memory-bound at different context lengths depending on model scale and batch size, yielding distinct accuracy--efficiency trade-offs.
  • Figure 5: GSM8K 0-shot scores of Llama 3.2 1B Instruct across different compression variants. Left: delayed eviction (default) with a 16-token window consistently preserves reasoning abilities of the model, while immediate eviction causes rapid degradation. The quality gap only widens as the compression gets stronger. Right: DMS requires an order of magnitude less data to train than DMC. This was also observed for Qwen 2.5 R1 models with 1.5B, 7B, and 32B parameter scales.
  • ...and 4 more figures