Table of Contents
Fetching ...

Higher-order Linear Attention

Yifan Zhang, Zhen Qin, Quanquan Gu

TL;DR

This work tackles the quadratic $O(n^2)$ cost of standard scaled dot-product attention by introducing Higher-order Linear Attention (HLA), a causal, streaming mechanism that achieves higher-order interactions through compact prefix statistics. At second order, HLA maintains a constant-size state per head and computes per-token outputs in $O(d^2 + d d_v)$ time without forming $n\times n$ matrices, while enforcing strict autoregressive masking via extended summaries. It then enables chunk-parallel training with associative scans that exactly reproduce serial recurrence activations, and extends the framework to asymmetric (AHLA) and third-order HLA, broadening the expressivity of attention-like mixers while preserving streaming efficiency. The approach offers a principled integration of attention-style, data-dependent mixing with the efficiency and parallelism of recurrent architectures, making it suitable for long-context modeling in practical transformers and related architectures. Overall, HLA provides a rigorous, scalable path to higher-order interactions in sequence models without sacrificing streaming updates or exact training equivalence.

Abstract

The quadratic cost of scaled dot-product attention is a central obstacle to scaling autoregressive language models to long contexts. Linear-time attention and State Space Models (SSMs) provide scalable alternatives but are typically restricted to first-order or kernel-based approximations, which can limit expressivity. We introduce Higher-order Linear Attention (HLA), a causal, streaming mechanism that realizes higher interactions via compact prefix sufficient statistics. In the second-order case, HLA maintains a constant-size state and computes per-token outputs in linear time without materializing any $n \times n$ matrices. We give closed-form streaming identities, a strictly causal masked variant using two additional summaries, and a chunk-parallel training scheme based on associative scans that reproduces the activations of a serial recurrence exactly. We further outline extensions to third and higher orders. Collectively, these results position HLA as a principled, scalable building block that combines attention-like, data-dependent mixing with the efficiency of modern recurrent architectures. Project Page: https://github.com/yifanzhang-pro/HLA.

Higher-order Linear Attention

TL;DR

This work tackles the quadratic cost of standard scaled dot-product attention by introducing Higher-order Linear Attention (HLA), a causal, streaming mechanism that achieves higher-order interactions through compact prefix statistics. At second order, HLA maintains a constant-size state per head and computes per-token outputs in time without forming matrices, while enforcing strict autoregressive masking via extended summaries. It then enables chunk-parallel training with associative scans that exactly reproduce serial recurrence activations, and extends the framework to asymmetric (AHLA) and third-order HLA, broadening the expressivity of attention-like mixers while preserving streaming efficiency. The approach offers a principled integration of attention-style, data-dependent mixing with the efficiency and parallelism of recurrent architectures, making it suitable for long-context modeling in practical transformers and related architectures. Overall, HLA provides a rigorous, scalable path to higher-order interactions in sequence models without sacrificing streaming updates or exact training equivalence.

Abstract

The quadratic cost of scaled dot-product attention is a central obstacle to scaling autoregressive language models to long contexts. Linear-time attention and State Space Models (SSMs) provide scalable alternatives but are typically restricted to first-order or kernel-based approximations, which can limit expressivity. We introduce Higher-order Linear Attention (HLA), a causal, streaming mechanism that realizes higher interactions via compact prefix sufficient statistics. In the second-order case, HLA maintains a constant-size state and computes per-token outputs in linear time without materializing any matrices. We give closed-form streaming identities, a strictly causal masked variant using two additional summaries, and a chunk-parallel training scheme based on associative scans that reproduces the activations of a serial recurrence exactly. We further outline extensions to third and higher orders. Collectively, these results position HLA as a principled, scalable building block that combines attention-like, data-dependent mixing with the efficiency of modern recurrent architectures. Project Page: https://github.com/yifanzhang-pro/HLA.

Paper Structure

This paper contains 39 sections, 4 theorems, 46 equations, 3 algorithms.

Key Result

Theorem 3.1

For each $t$, let Consequently, the strictly causal, masked default unnormalized output is An optional linear normalization divides by the masked denominator, where $\varepsilon >0$ is a small constant added for numerical stability.

Theorems & Definitions (5)

  • Theorem 3.1: Masked streaming identity for second order
  • Theorem 4.1: Scan equivalence: serial vs. (decayed) associative scans
  • Remark 4.2: Inclusive vs. exclusive scans
  • Theorem 6.1: Masked streaming identity for AHLA
  • Theorem 7.1: Masked streaming identity for third order