Solving Attention Kernel Regression Problem via Pre-conditioner
Zhao Song, Junze Yin, Lichen Zhang
TL;DR
The paper tackles the computational bottleneck of attention in large models by introducing proxy problems that approximate the attention matrix: a matrix-exponential proxy using powers of $A^\top A$ and an attention-kernel proxy via $\exp(AA^\top)$. It develops fast, randomized algorithms based on sketching and preconditioning to solve regression problems against these proxies, achieving near-linear time in $n$ and $d$ under practical regimes, with provable error bounds and high-probability guarantees. The core contributions include (i) efficient algorithms for regressing against $(A^\top A)^j$ and $A(A^\top A)^j$ with explicit time bounds, (ii) a method to approximate $\exp(AA^\top)$ via spectral-approximate sketches and a preconditioner to enable fast gradient-based solves, and (iii) a thorough analysis relating these subproblems to matrix-exponential expansions, offering an alternative perspective on efficient attention-approximation techniques. These results have potential practical impact for speeding up attention-related computations in transformers and other attention-based architectures, especially in large-scale or structured-data settings.
Abstract
The attention mechanism is the key to large language models, and the attention matrix serves as an algorithmic and computational bottleneck for such a scheme. In this paper, we define two problems, motivated by designing fast algorithms for proxy of attention matrix and solving regressions against them. Given an input matrix $A\in \mathbb{R}^{n\times d}$ with $n\gg d$ and a response vector $b$, we first consider the matrix exponential of the matrix $A^\top A$ as a proxy, and we in turn design algorithms for two types of regression problems: $\min_{x\in \mathbb{R}^d}\|(A^\top A)^jx-b\|_2$ and $\min_{x\in \mathbb{R}^d}\|A(A^\top A)^jx-b\|_2$ for any positive integer $j$. Studying algorithms for these regressions is essential, as matrix exponential can be approximated term-by-term via these smaller problems. The second proxy is applying exponential entrywise to the Gram matrix, denoted by $\exp(AA^\top)$ and solving the regression $\min_{x\in \mathbb{R}^n}\|\exp(AA^\top)x-b \|_2$. We call this problem the attention kernel regression problem, as the matrix $\exp(AA^\top)$ could be viewed as a kernel function with respect to $A$. We design fast algorithms for these regression problems, based on sketching and preconditioning. We hope these efforts will provide an alternative perspective of studying efficient approximation of attention matrices.
