Lexico: Extreme KV Cache Compression via Sparse Coding over Universal Dictionaries
Junhyuck Kim, Jongho Park, Jaewoong Cho, Dimitris Papailiopoulos
TL;DR
This work tackles the KV cache memory bottleneck in autoregressive transformers by introducing Lexico, a sparse dictionary-based compression framework. It learns layer-specific dictionaries on a universal dictionary of about $N=4096$ atoms and uses orthogonal matching pursuit to represent keys and values with sparsity $s$, enabling reconstruction via $\hat{\bm{K}} = {\bm{K}}_{\text{csr}} {\bm{D}}_k^T$ and $\hat{\bm{V}} = {\bm{V}}_{\text{csr}} {\bm{D}}_v^T$, while keeping a small full-precision buffer for the most recent tokens. Lexico supports flexible memory-accuracy trade-offs, achieving 15–25% of full cache with 90–95% GSM8K performance, and outperforms both eviction and quantization baselines in ultra-low-memory regimes; adaptive dictionary learning and error-thresholding further improve performance at the cost of extra memory. The approach yields strong memory savings for long-context tasks, with universality and off-the-shelf applicability due to its input-agnostic dictionaries, offering practical deployment benefits for memory-constrained LLM inference across diverse models and prompts.
Abstract
We introduce Lexico, a novel KV cache compression method that leverages sparse coding with a universal dictionary. Our key finding is that key-value cache in modern LLMs can be accurately approximated using sparse linear combination from a small, input-agnostic dictionary of ~4k atoms, enabling efficient compression across different input prompts, tasks and models. Using orthogonal matching pursuit for sparse approximation, Lexico achieves flexible compression ratios through direct sparsity control. On GSM8K, across multiple model families (Mistral, Llama 3, Qwen2.5), Lexico maintains 90-95% of the original performance while using only 15-25% of the full KV-cache memory, outperforming both quantization and token eviction methods. Notably, Lexico remains effective in low memory regimes where 2-bit quantization fails, achieving up to 1.7x better compression on LongBench and GSM8K while maintaining high accuracy.
