Higher-order Linear Attention

Yifan Zhang; Zhen Qin; Quanquan Gu

Higher-order Linear Attention

Yifan Zhang, Zhen Qin, Quanquan Gu

TL;DR

This work tackles the quadratic $O(n^2)$ cost of standard scaled dot-product attention by introducing Higher-order Linear Attention (HLA), a causal, streaming mechanism that achieves higher-order interactions through compact prefix statistics. At second order, HLA maintains a constant-size state per head and computes per-token outputs in $O(d^2 + d d_v)$ time without forming $n\times n$ matrices, while enforcing strict autoregressive masking via extended summaries. It then enables chunk-parallel training with associative scans that exactly reproduce serial recurrence activations, and extends the framework to asymmetric (AHLA) and third-order HLA, broadening the expressivity of attention-like mixers while preserving streaming efficiency. The approach offers a principled integration of attention-style, data-dependent mixing with the efficiency and parallelism of recurrent architectures, making it suitable for long-context modeling in practical transformers and related architectures. Overall, HLA provides a rigorous, scalable path to higher-order interactions in sequence models without sacrificing streaming updates or exact training equivalence.

Abstract

The quadratic cost of scaled dot-product attention is a central obstacle to scaling autoregressive language models to long contexts. Linear-time attention and State Space Models (SSMs) provide scalable alternatives but are typically restricted to first-order or kernel-based approximations, which can limit expressivity. We introduce Higher-order Linear Attention (HLA), a causal, streaming mechanism that realizes higher interactions via compact prefix sufficient statistics. In the second-order case, HLA maintains a constant-size state and computes per-token outputs in linear time without materializing any $n \times n$ matrices. We give closed-form streaming identities, a strictly causal masked variant using two additional summaries, and a chunk-parallel training scheme based on associative scans that reproduces the activations of a serial recurrence exactly. We further outline extensions to third and higher orders. Collectively, these results position HLA as a principled, scalable building block that combines attention-like, data-dependent mixing with the efficiency of modern recurrent architectures. Project Page: https://github.com/yifanzhang-pro/HLA.

Higher-order Linear Attention

TL;DR

This work tackles the quadratic

cost of standard scaled dot-product attention by introducing Higher-order Linear Attention (HLA), a causal, streaming mechanism that achieves higher-order interactions through compact prefix statistics. At second order, HLA maintains a constant-size state per head and computes per-token outputs in

time without forming

matrices, while enforcing strict autoregressive masking via extended summaries. It then enables chunk-parallel training with associative scans that exactly reproduce serial recurrence activations, and extends the framework to asymmetric (AHLA) and third-order HLA, broadening the expressivity of attention-like mixers while preserving streaming efficiency. The approach offers a principled integration of attention-style, data-dependent mixing with the efficiency and parallelism of recurrent architectures, making it suitable for long-context modeling in practical transformers and related architectures. Overall, HLA provides a rigorous, scalable path to higher-order interactions in sequence models without sacrificing streaming updates or exact training equivalence.

Abstract

matrices. We give closed-form streaming identities, a strictly causal masked variant using two additional summaries, and a chunk-parallel training scheme based on associative scans that reproduces the activations of a serial recurrence exactly. We further outline extensions to third and higher orders. Collectively, these results position HLA as a principled, scalable building block that combines attention-like, data-dependent mixing with the efficiency of modern recurrent architectures. Project Page: https://github.com/yifanzhang-pro/HLA.

Higher-order Linear Attention

TL;DR

Abstract

Higher-order Linear Attention

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Theorems & Definitions (5)