Table of Contents
Fetching ...

SWAN: Sparse Winnowed Attention for Reduced Inference Memory via Decompression-Free KV-Cache Compression

Santhosh G S, Saurav Prakash, Balaraman Ravindran

TL;DR

SWAN addresses the KV-cache memory bottleneck in autoregressive LLMs by introducing a decompression-free framework that rotates and prunes KV vectors in an offline learned subspace and then uses the sparse cache directly during attention. The method combines an offline SVD-derived projection per layer with a runtime absorption for V and O, while keeping QK projections RoPE-aware and applied at decode time, balanced by a dense 128-token buffer. The approach achieves 50-60% per-token memory savings with robust performance across reasoning and long-context benchmarks, and introduces a tunable compression parameter that allows dynamic memory-accuracy trade-offs. The key theoretical insight is a break-even point for computational savings given by $L > \frac{d_h^2}{d_h - k_{active}} + b$, guiding deployment for long sequences, with practical validation across Llama-3.1 and OLMoE architectures.

Abstract

Large Language Models (LLMs) face a significant bottleneck during autoregressive inference due to the massive memory footprint of the Key-Value (KV) cache. Existing compression techniques like token eviction, quantization, or other low-rank methods often risk information loss, have fixed limits, or introduce significant computational overhead from explicit decompression steps. In this work, we introduce SWAN, a novel, fine-tuning-free framework that eliminates this overhead. Our method uses an offline orthogonal matrix to rotate and prune the KV-cache, which is then used directly in the attention computation without any reconstruction. Our extensive experiments demonstrate that SWAN, augmented with a small dense buffer, offers a robust trade-off, maintaining performance close to the uncompressed baseline even at aggressive 50-60% memory savings per-token on KV-cache. A key advantage is its runtime-tunable compression level, allowing operators to dynamically adjust the memory footprint, a flexibility absent in methods requiring fixed offline configurations. This combination of a decompression-free design, high performance under compression, and adaptability makes SWAN a practical and efficient solution for serving LLMs with long contexts.

SWAN: Sparse Winnowed Attention for Reduced Inference Memory via Decompression-Free KV-Cache Compression

TL;DR

SWAN addresses the KV-cache memory bottleneck in autoregressive LLMs by introducing a decompression-free framework that rotates and prunes KV vectors in an offline learned subspace and then uses the sparse cache directly during attention. The method combines an offline SVD-derived projection per layer with a runtime absorption for V and O, while keeping QK projections RoPE-aware and applied at decode time, balanced by a dense 128-token buffer. The approach achieves 50-60% per-token memory savings with robust performance across reasoning and long-context benchmarks, and introduces a tunable compression parameter that allows dynamic memory-accuracy trade-offs. The key theoretical insight is a break-even point for computational savings given by , guiding deployment for long sequences, with practical validation across Llama-3.1 and OLMoE architectures.

Abstract

Large Language Models (LLMs) face a significant bottleneck during autoregressive inference due to the massive memory footprint of the Key-Value (KV) cache. Existing compression techniques like token eviction, quantization, or other low-rank methods often risk information loss, have fixed limits, or introduce significant computational overhead from explicit decompression steps. In this work, we introduce SWAN, a novel, fine-tuning-free framework that eliminates this overhead. Our method uses an offline orthogonal matrix to rotate and prune the KV-cache, which is then used directly in the attention computation without any reconstruction. Our extensive experiments demonstrate that SWAN, augmented with a small dense buffer, offers a robust trade-off, maintaining performance close to the uncompressed baseline even at aggressive 50-60% memory savings per-token on KV-cache. A key advantage is its runtime-tunable compression level, allowing operators to dynamically adjust the memory footprint, a flexibility absent in methods requiring fixed offline configurations. This combination of a decompression-free design, high performance under compression, and adaptability makes SWAN a practical and efficient solution for serving LLMs with long contexts.

Paper Structure

This paper contains 29 sections, 5 theorems, 7 equations, 6 figures, 3 tables, 1 algorithm.

Key Result

Lemma A.1

Let $P_{QK} \in \mathbb{R}^{d_h \times d_h}$ be an orthogonal projection matrix derived from offline calibration. Let the projected query and key vectors be $\hat{q}_{i+1} = q_{i+1}P_{QK}$ and $\widehat{K}_{cache} = K_{cache}P_{QK}$. The attention scores computed using the original vectors ($S$) are

Figures (6)

  • Figure 1: An illustration of the SWAN attention mechanism during a single autoregressive decoding step for token $i+1$. The process begins with the input $x_{i+1}$, where the query ($q_{i+1}$) and key ($k_{i+1}$) are projected at runtime by the orthogonal matrix $P_{QK}$ to produce their rotated counterparts, $\hat{q}_{i+1}$ and $\hat{k}_{i+1}$. The value vector is generated directly in the rotated space as $\hat{v}_{i+1}$ using the pre-modified weight matrix $\widehat{W}_V$. The core of our method is the hybrid KV-cache, composed of two parts: (2) a Sparse KV-Cache storing pruned historical vectors, and a small, dense Buffer KV-Cache for recent vectors. As new vectors ($\hat{k}_{i+1}, \hat{v}_{i+1}$) enter the buffer, the oldest buffer vector is pruned based on magnitude ('arg top-k') and moved to the sparse cache. The final attention output ($\tilde{o}_{i+1}$) is computed using the rotated query $\hat{q}_{i+1}$ and the (3) Effective KV-Cache, which is the combination of both sparse and buffer caches, thus avoiding any decompression overhead.
  • Figure 2: (a) Relationship between pruning ratio (dimensions retained) and effective memory compression. The shaded region indicates where the sparse representation is larger than the dense original. For 16-bit values, savings begin only when the retention ratio is below 0.66, this threshold is significantly lower when using 8-bit quantized values and is almost one-to-one. (b) Performance of SWAN variants of Llama-3.1-8B-Instruct on GSM8K reasoning benchmark. The buffered SWAN variants ('bt=128') demonstrate strong resilience, significantly outperforming the zero-buffer versions
  • Figure 3: Performance on key NLP benchmarks for Llama-3.1-8B-Instruct (top) and OLMoE-1B-7B-Instruct (bottom). The buffered SWAN ('bt=64') maintains high performance even at significant compression ratios. Note the consistently smaller performance drop on the sparser OLMoE model, highlighting SWAN's ability to exploit the inherent sparsity in model architectures.
  • Figure 4: SWAN's performance on the LongBench suite. The figure highlights the performance across two summarization tasks, Multi-News and SAMSum, as well as the average performance across the MultiNews, LCC, SAMSum, Multi-News, and TREC tasks.The 128-token buffer ('bt=128') is critical, preventing the catastrophic failure seen in the zero-buffer versions. Note the graceful degradation of the buffered variants and the strong performance of the 8-bit version on summarization, even under aggressive compression.
  • Figure 5: Detailed performance on additional NLP tasks for Llama-3.1-8B-Instruct (top row) and OLMoE-1B-7B-0924-Instruct (bottom row). The figure displays results for Winogrande, HellaSwag, TruthfulQA MC2, and WikiText. The trends confirm the critical role of the dense buffer in preserving performance across different task types and model architectures.
  • ...and 1 more figures

Theorems & Definitions (10)

  • Lemma A.1: Rotational Invariance of Attention Scores
  • proof
  • Lemma A.2: Losslessness of Full Attention with Absorbed Weights
  • proof
  • Proposition A.3: Complexity of Standard Attention
  • proof
  • Proposition A.4: Complexity of SWAN
  • proof
  • Proposition A.5: Computational Break-Even Point
  • proof