Table of Contents
Fetching ...

Lexico: Extreme KV Cache Compression via Sparse Coding over Universal Dictionaries

Junhyuck Kim, Jongho Park, Jaewoong Cho, Dimitris Papailiopoulos

TL;DR

This work tackles the KV cache memory bottleneck in autoregressive transformers by introducing Lexico, a sparse dictionary-based compression framework. It learns layer-specific dictionaries on a universal dictionary of about $N=4096$ atoms and uses orthogonal matching pursuit to represent keys and values with sparsity $s$, enabling reconstruction via $\hat{\bm{K}} = {\bm{K}}_{\text{csr}} {\bm{D}}_k^T$ and $\hat{\bm{V}} = {\bm{V}}_{\text{csr}} {\bm{D}}_v^T$, while keeping a small full-precision buffer for the most recent tokens. Lexico supports flexible memory-accuracy trade-offs, achieving 15–25% of full cache with 90–95% GSM8K performance, and outperforms both eviction and quantization baselines in ultra-low-memory regimes; adaptive dictionary learning and error-thresholding further improve performance at the cost of extra memory. The approach yields strong memory savings for long-context tasks, with universality and off-the-shelf applicability due to its input-agnostic dictionaries, offering practical deployment benefits for memory-constrained LLM inference across diverse models and prompts.

Abstract

We introduce Lexico, a novel KV cache compression method that leverages sparse coding with a universal dictionary. Our key finding is that key-value cache in modern LLMs can be accurately approximated using sparse linear combination from a small, input-agnostic dictionary of ~4k atoms, enabling efficient compression across different input prompts, tasks and models. Using orthogonal matching pursuit for sparse approximation, Lexico achieves flexible compression ratios through direct sparsity control. On GSM8K, across multiple model families (Mistral, Llama 3, Qwen2.5), Lexico maintains 90-95% of the original performance while using only 15-25% of the full KV-cache memory, outperforming both quantization and token eviction methods. Notably, Lexico remains effective in low memory regimes where 2-bit quantization fails, achieving up to 1.7x better compression on LongBench and GSM8K while maintaining high accuracy.

Lexico: Extreme KV Cache Compression via Sparse Coding over Universal Dictionaries

TL;DR

This work tackles the KV cache memory bottleneck in autoregressive transformers by introducing Lexico, a sparse dictionary-based compression framework. It learns layer-specific dictionaries on a universal dictionary of about atoms and uses orthogonal matching pursuit to represent keys and values with sparsity , enabling reconstruction via and , while keeping a small full-precision buffer for the most recent tokens. Lexico supports flexible memory-accuracy trade-offs, achieving 15–25% of full cache with 90–95% GSM8K performance, and outperforms both eviction and quantization baselines in ultra-low-memory regimes; adaptive dictionary learning and error-thresholding further improve performance at the cost of extra memory. The approach yields strong memory savings for long-context tasks, with universality and off-the-shelf applicability due to its input-agnostic dictionaries, offering practical deployment benefits for memory-constrained LLM inference across diverse models and prompts.

Abstract

We introduce Lexico, a novel KV cache compression method that leverages sparse coding with a universal dictionary. Our key finding is that key-value cache in modern LLMs can be accurately approximated using sparse linear combination from a small, input-agnostic dictionary of ~4k atoms, enabling efficient compression across different input prompts, tasks and models. Using orthogonal matching pursuit for sparse approximation, Lexico achieves flexible compression ratios through direct sparsity control. On GSM8K, across multiple model families (Mistral, Llama 3, Qwen2.5), Lexico maintains 90-95% of the original performance while using only 15-25% of the full KV-cache memory, outperforming both quantization and token eviction methods. Notably, Lexico remains effective in low memory regimes where 2-bit quantization fails, achieving up to 1.7x better compression on LongBench and GSM8K while maintaining high accuracy.

Paper Structure

This paper contains 31 sections, 7 equations, 7 figures, 10 tables, 1 algorithm.

Figures (7)

  • Figure 1: Memory usage vs. performance of Lexico compared to other cache compression methods on GSM8K. The figure illustrates the relationship between cache size and the performance of Lexico on Llama models on GSM8K 5-shot evaluation. For Lexico, we use a dictionary size of $N = 4096$ atoms and keep the last $128$ tokens in full-precision (buffer size $n_b=128$). Lexico consistently outperforms both eviction-based methods (SnapKV, PyramidKV) and quantization-based methods (per-token quantization, KIVI, ZipCache).
  • Figure 2: (a) Prefilling: Following attention computation, Lexico uses OMP to find sparse representations of the vectors ($3\text{-}8\times$ smaller). (b) Decoding: Key cache consists of the compressed sparse key cache, ${\bm{K}}_{\text{csr}}$, and an full-precision buffer, ${\bm{K}}_{\text{buffer}}$, for the most recent tokens. ${\bm{q}}_t$, ${\bm{k}}_t$ represent the query, key vectors for the newly generated token. Computation is reduced by computing the query-dictionary product, ${\bm{q}}_t {\bm{D}}_k$, then multiplying ${\bm{K}}_{\text{csr}}$, to get the pre-softmax attention score.
  • Figure 3: Left shows a pairwise cosine similarity matrix between key vectors generated from one input text from all heads in Layer 10 of Llama-3.1-8B-Instruct. Keys are sorted by similarity to demonstrate the clusters. Right shows the similarity matrix between key vectors from two different input texts. These plots indicate that there may exist a mixture of low-dimensional subspaces in the space of all possible keys, a hypothesis that naturally leads to dictionary learning.
  • Figure 4: Dictionary Learning of Lexico. We train a linear layer ${\bm{D}}$ (our dictionary) that minimizes $\ell_2$-reconstruction error of cache. The cache of layer $i$ are used as training data for dictionary ${\bm{D}}^{(i)}$. Each step, we apply with fixed ${\bm{D}}$ to represent as a vector of sparse coefficients; we then perform a step of gradient descent on ${\bm{D}}$ and repeat the process. A sparse vector can be efficiently stored as a , using a tuple of 16-bit index and 8-bit value.
  • Figure 5: Memory usage vs. performance of Qwen2.5-14B-Instruct with Lexico on GSM8K. We compare the performance of Lexico against quantization methods on Qwen2.5-14B-Instruct, with its weights quantized to 4 bits. For Lexico, we use $N = 4096$ as the dictionary size and $n_b = 128$ as the buffer size.
  • ...and 2 more figures