Table of Contents
Fetching ...

Local Linear Attention: An Optimal Interpolation of Linear and Softmax Attention For Test-Time Regression

Yifei Zuo, Yutong Yin, Zhichen Zeng, Ang Li, Banghua Zhu, Zhaoran Wang

TL;DR

This work introduces Local Linear Attention (LLA), a regression-based attention mechanism that interpolates between Linear Attention and Softmax Attention within a test-time regression framework. The authors provide a bias-variance analysis showing LLA can achieve favorable associative recall and non-stationary adaptation, and they develop FlashLLA, a memory-efficient blockwise algorithm with a conjugate-gradient-based matrix-free inversion to tackle $Θ(n^2 d)$ and $Θ(n d^2)$ memory costs. The approach is validated through synthetic test-time regression, in-context regression, associative recall, and state-tracking tasks, demonstrating robust performance and scalability potential for long-context and large models. The work also outlines practical implementation details, including memory primitives, a blockwise forward pass, and kernel development considerations for deployment on modern accelerators.

Abstract

Transformer architectures have achieved remarkable success in various domains. While efficient alternatives to Softmax Attention have been widely studied, the search for more expressive mechanisms grounded in theoretical insight-even at greater computational cost-has been relatively underexplored. In this work, we bridge this gap by proposing Local Linear Attention (LLA), a novel attention mechanism derived from nonparametric statistics through the lens of test-time regression. First, we show that LLA offers theoretical advantages over Linear and Softmax Attention for associative memory via a bias-variance trade-off analysis. Next, we address its computational challenges and propose two memory-efficient primitives to tackle the $Θ(n^2 d)$ and $Θ(n d^2)$ complexity. We then introduce FlashLLA, a hardware-efficient, blockwise algorithm that enables scalable and parallel computation on modern accelerators. In addition, we implement and profile a customized inference kernel that significantly reduces memory overheads. Finally, we empirically validate the advantages and limitations of LLA on test-time regression, in-context regression, associative recall and state tracking tasks. Experiment results demonstrate that LLA effectively adapts to non-stationarity, outperforming strong baselines in test-time training and in-context learning, and exhibiting promising evidence for its scalability and applicability in large-scale models. Code is available at https://github.com/Yifei-Zuo/Flash-LLA.

Local Linear Attention: An Optimal Interpolation of Linear and Softmax Attention For Test-Time Regression

TL;DR

This work introduces Local Linear Attention (LLA), a regression-based attention mechanism that interpolates between Linear Attention and Softmax Attention within a test-time regression framework. The authors provide a bias-variance analysis showing LLA can achieve favorable associative recall and non-stationary adaptation, and they develop FlashLLA, a memory-efficient blockwise algorithm with a conjugate-gradient-based matrix-free inversion to tackle and memory costs. The approach is validated through synthetic test-time regression, in-context regression, associative recall, and state-tracking tasks, demonstrating robust performance and scalability potential for long-context and large models. The work also outlines practical implementation details, including memory primitives, a blockwise forward pass, and kernel development considerations for deployment on modern accelerators.

Abstract

Transformer architectures have achieved remarkable success in various domains. While efficient alternatives to Softmax Attention have been widely studied, the search for more expressive mechanisms grounded in theoretical insight-even at greater computational cost-has been relatively underexplored. In this work, we bridge this gap by proposing Local Linear Attention (LLA), a novel attention mechanism derived from nonparametric statistics through the lens of test-time regression. First, we show that LLA offers theoretical advantages over Linear and Softmax Attention for associative memory via a bias-variance trade-off analysis. Next, we address its computational challenges and propose two memory-efficient primitives to tackle the and complexity. We then introduce FlashLLA, a hardware-efficient, blockwise algorithm that enables scalable and parallel computation on modern accelerators. In addition, we implement and profile a customized inference kernel that significantly reduces memory overheads. Finally, we empirically validate the advantages and limitations of LLA on test-time regression, in-context regression, associative recall and state tracking tasks. Experiment results demonstrate that LLA effectively adapts to non-stationarity, outperforming strong baselines in test-time training and in-context learning, and exhibiting promising evidence for its scalability and applicability in large-scale models. Code is available at https://github.com/Yifei-Zuo/Flash-LLA.

Paper Structure

This paper contains 50 sections, 12 theorems, 110 equations, 8 figures, 2 algorithms.

Key Result

Proposition 2.1

Let $(X_i,Y_i)_{i=1}^n$ be i.i.d., $X_i\in\mathbb{R}^d$ supported on a bounded set $D\subset\mathbb{R}^d$, and $Y_i=f(X_i)+\varepsilon_i\in\mathbb{R}^{d_y}$ with $\mathbb{E}[\varepsilon_i\mid X_i]=0$ and $\mathbb{E}[\varepsilon_i^2\mid X_i]=\sigma^2(X_i)$. Let $\widehat{f}_{\mathrm{GL}}$ denote a gl

Figures (8)

  • Figure 1: A comparison of regression strategies: global linear models (e.g., SSMs, MesaNet, DeltaNet) employ query agnostic linear fits and suffer from irreducible approximation error due to model misspecification; local constant models (e.g., Softmax Attention) perform query-specific local averaging but exhibit boundary bias; local linear models (e.g., LLA) achieve a superior bias-variance trade-off combining locality and linear fitting.
  • Figure 2: FlashLLA reduces the working set memory to $\Theta(nd)$. The figure shows the profiling result for $d=128$, OOM points are omitted.
  • Figure 3: Test-time regression performance on a piecewise-linear task. The figures demonstrate position-wise MSE for $d=64$ with $S\in\{64,256,512,1024\}$. Results are averaged over $10{,}000$ independently sampled sequences; LLA outperforms other baselines and benefits from more in-segment data; MesaNet excels only before the first shift. The y-axis uses a logarithmic scale.
  • Figure 4: The advantage of LLA scales with the data dimension. Axes are in logarithmic scale.
  • Figure 5: Figure \ref{['fig:ic-regression']} and \ref{['fig:ic-recall']} shown for models with $d=128$ and $2$ attention heads. Each point represents the best performance achieved across training hyperparameters, averaged over $3$ random seeds. LLA consistently outperforms other baselines in in-context regression similar to the test-time regression task, and achieves the highest accuracy in associative recall across different sequence lengths and number of key-value pairs.
  • ...and 3 more figures

Theorems & Definitions (27)

  • Proposition 2.1
  • Proposition 2.2
  • Lemma A.1
  • proof
  • Definition A.1: Exact kernel domain and boundary layer.
  • Definition A.2: Exact kernel moments
  • Lemma A.2
  • proof
  • Definition A.3: Uniform Boundary Layer
  • Lemma A.3
  • ...and 17 more