Table of Contents
Fetching ...

WildCat: Near-Linear Attention in Theory and Practice

Tobias Schröder, Lester Mackey

TL;DR

WildCat tackles the quadratic cost of attention by building a small, optimally weighted coreset of keys and values using a randomly pivoted Cholesky-based strategy and Nyström weighting. The method yields a near-linear-time, spectrally accurate approximation of attention, with theoretical guarantees of super-polynomial error decay under bounded inputs and scalable to growing sequence length and dimensionality. Empirically, WildCat delivers substantial speedups and memory savings while preserving or improving performance on image generation, image classification, and long-context KV-cache tasks, aided by a GPU-optimized implementation. This approach narrows the theory-practice gap in attention approximation and enables efficient deployment of transformer-based models in resource-constrained settings.

Abstract

We introduce WildCat, a high-accuracy, low-cost approach to compressing the attention mechanism in neural networks. While attention is a staple of modern network architectures, it is also notoriously expensive to deploy due to resource requirements that scale quadratically with the input sequence length $n$. WildCat avoids these quadratic costs by only attending over a small weighted coreset. Crucially, we select the coreset using a fast but spectrally-accurate subsampling algorithm -- randomly pivoted Cholesky -- and weight the elements optimally to minimise reconstruction error. Remarkably, given bounded inputs, WildCat approximates exact attention with super-polynomial $O(n^{-\sqrt{\log(\log(n))}})$ error decay while running in near-linear $O(n^{1+o(1)})$ time. In contrast, prior practical approximations either lack error guarantees or require quadratic runtime to guarantee such high fidelity. We couple this advance with a GPU-optimized PyTorch implementation and a suite of benchmark experiments demonstrating the benefits of WildCat for image generation, image classification, and language model KV cache compression.

WildCat: Near-Linear Attention in Theory and Practice

TL;DR

WildCat tackles the quadratic cost of attention by building a small, optimally weighted coreset of keys and values using a randomly pivoted Cholesky-based strategy and Nyström weighting. The method yields a near-linear-time, spectrally accurate approximation of attention, with theoretical guarantees of super-polynomial error decay under bounded inputs and scalable to growing sequence length and dimensionality. Empirically, WildCat delivers substantial speedups and memory savings while preserving or improving performance on image generation, image classification, and long-context KV-cache tasks, aided by a GPU-optimized implementation. This approach narrows the theory-practice gap in attention approximation and enables efficient deployment of transformer-based models in resource-constrained settings.

Abstract

We introduce WildCat, a high-accuracy, low-cost approach to compressing the attention mechanism in neural networks. While attention is a staple of modern network architectures, it is also notoriously expensive to deploy due to resource requirements that scale quadratically with the input sequence length . WildCat avoids these quadratic costs by only attending over a small weighted coreset. Crucially, we select the coreset using a fast but spectrally-accurate subsampling algorithm -- randomly pivoted Cholesky -- and weight the elements optimally to minimise reconstruction error. Remarkably, given bounded inputs, WildCat approximates exact attention with super-polynomial error decay while running in near-linear time. In contrast, prior practical approximations either lack error guarantees or require quadratic runtime to guarantee such high fidelity. We couple this advance with a GPU-optimized PyTorch implementation and a suite of benchmark experiments demonstrating the benefits of WildCat for image generation, image classification, and language model KV cache compression.
Paper Structure (44 sections, 27 theorems, 158 equations, 5 figures, 4 tables)

This paper contains 44 sections, 27 theorems, 158 equations, 5 figures, 4 tables.

Key Result

Lemma 1

Let $\widehat{\mathbf{A}}$ be an approximation to $\mathbf{A}$, $\mathbf{\widehat{D}} = \mathrm{diag}(\widehat{\mathbf{A}} \mathbf{1}_n)$, and $\widehat{\mathbf O} \triangleq \mathrm{clip}(\mathbf{\widehat{D}}^{-1}\widehat{\mathbf{A}} \mathbf{V}, \mathbf{v}_\mathrm{min}, \mathbf{v}_\mathrm{max})$ fo

Figures (5)

  • Figure 1: Visualisation of the \ref{['alg:cmpd-attn']} methodology. Our goal is the approximation of the off-diagonal block $\mathbf{A}$ through a Nyström approximation $\widehat{\mathbf{A}}_\tau \triangleq h(\mathbf{Q}, \mathbf{K}_{\mathcal{S}})h\mathopen{}\mathclose{\left({\frac{1}{\tau}\mathbf{K}_{\mathcal{S}}, \frac{1}{\tau}\mathbf{K}_{\mathcal{S}}}\right)^{-1}}h\mathopen{}\mathclose{\left({\frac{1}{\tau}\mathbf{K}_{\mathcal{S}}, \frac{1}{\tau}\mathbf{K}}\right)$. With the right order of operations, the computation cost for $\mathbf{A}\mathbf{V}$ decreases from $O (mnd)$ to $O (rnd + mrd + nr^2)$. For the exponential kernel, the off-diagonal block is invariant under $\mathbf{Q} \to \tau\mathbf{Q}$, $\mathbf{K} \to \frac{1}}{\tau}\mathbf{K}$. Since we only select coreset points from $\mathbf{K}$, we can optimise for low-rank approximability.
  • Figure 2: CompressKV
  • Figure 3: WtdAttn
  • Figure 4: WildCat
  • Figure 5: Example generations from BigGAN with exact or approximate attention.

Theorems & Definitions (49)

  • Lemma 1: Approximate attention guarantee
  • Lemma 2: Nyström guarantee
  • Theorem 1: RPNys guarantee
  • Lemma 3: Taylor guarantee
  • Lemma 4: Taylor rank bound
  • Theorem 2: WildCat guarantee
  • Corollary 1: Super-polynomial error decay in near-linear time
  • proof
  • Corollary 2: Refined super-polynomial error decay in near-linear time
  • Lemma 4.1: Iterated expected residual bound
  • ...and 39 more