Higher-order Linear Attention
Yifan Zhang, Zhen Qin, Quanquan Gu
TL;DR
This work tackles the quadratic $O(n^2)$ cost of standard scaled dot-product attention by introducing Higher-order Linear Attention (HLA), a causal, streaming mechanism that achieves higher-order interactions through compact prefix statistics. At second order, HLA maintains a constant-size state per head and computes per-token outputs in $O(d^2 + d d_v)$ time without forming $n\times n$ matrices, while enforcing strict autoregressive masking via extended summaries. It then enables chunk-parallel training with associative scans that exactly reproduce serial recurrence activations, and extends the framework to asymmetric (AHLA) and third-order HLA, broadening the expressivity of attention-like mixers while preserving streaming efficiency. The approach offers a principled integration of attention-style, data-dependent mixing with the efficiency and parallelism of recurrent architectures, making it suitable for long-context modeling in practical transformers and related architectures. Overall, HLA provides a rigorous, scalable path to higher-order interactions in sequence models without sacrificing streaming updates or exact training equivalence.
Abstract
The quadratic cost of scaled dot-product attention is a central obstacle to scaling autoregressive language models to long contexts. Linear-time attention and State Space Models (SSMs) provide scalable alternatives but are typically restricted to first-order or kernel-based approximations, which can limit expressivity. We introduce Higher-order Linear Attention (HLA), a causal, streaming mechanism that realizes higher interactions via compact prefix sufficient statistics. In the second-order case, HLA maintains a constant-size state and computes per-token outputs in linear time without materializing any $n \times n$ matrices. We give closed-form streaming identities, a strictly causal masked variant using two additional summaries, and a chunk-parallel training scheme based on associative scans that reproduces the activations of a serial recurrence exactly. We further outline extensions to third and higher orders. Collectively, these results position HLA as a principled, scalable building block that combines attention-like, data-dependent mixing with the efficiency of modern recurrent architectures. Project Page: https://github.com/yifanzhang-pro/HLA.
