ZeroS: Zero-Sum Linear Attention for Efficient Transformers

Jiecheng Lu; Xu Han; Yan Sun; Viresh Pati; Yubin Kim; Siddhartha Somani; Shihao Yang

ZeroS: Zero-Sum Linear Attention for Efficient Transformers

Jiecheng Lu, Xu Han, Yan Sun, Viresh Pati, Yubin Kim, Siddhartha Somani, Shihao Yang

TL;DR

The proposed Zero-Sum Linear Attention (ZeroS), which addresses limitations of linear attention by removing the constant zero-order term $1/t$ and reweighting the remaining zero-sum softmax residuals, creates mathematically stable weights, enabling both positive and negative values and allowing a single attention layer to perform contrastive operations.

Abstract

Linear attention methods offer Transformers $O(N)$ complexity but typically underperform standard softmax attention. We identify two fundamental limitations affecting these approaches: the restriction to convex combinations that only permits additive information blending, and uniform accumulated weight bias that dilutes attention in long contexts. We propose Zero-Sum Linear Attention (ZeroS), which addresses these limitations by removing the constant zero-order term $1/t$ and reweighting the remaining zero-sum softmax residuals. This modification creates mathematically stable weights, enabling both positive and negative values and allowing a single attention layer to perform contrastive operations. While maintaining $O(N)$ complexity, ZeroS theoretically expands the set of representable functions compared to convex combinations. Empirically, it matches or exceeds standard softmax attention across various sequence modeling benchmarks.

ZeroS: Zero-Sum Linear Attention for Efficient Transformers

TL;DR

The proposed Zero-Sum Linear Attention (ZeroS), which addresses limitations of linear attention by removing the constant zero-order term

and reweighting the remaining zero-sum softmax residuals, creates mathematically stable weights, enabling both positive and negative values and allowing a single attention layer to perform contrastive operations.

Abstract

Linear attention methods offer Transformers

complexity but typically underperform standard softmax attention. We identify two fundamental limitations affecting these approaches: the restriction to convex combinations that only permits additive information blending, and uniform accumulated weight bias that dilutes attention in long contexts. We propose Zero-Sum Linear Attention (ZeroS), which addresses these limitations by removing the constant zero-order term

and reweighting the remaining zero-sum softmax residuals. This modification creates mathematically stable weights, enabling both positive and negative values and allowing a single attention layer to perform contrastive operations. While maintaining

complexity, ZeroS theoretically expands the set of representable functions compared to convex combinations. Empirically, it matches or exceeds standard softmax attention across various sequence modeling benchmarks.

Paper Structure (35 sections, 7 theorems, 45 equations, 5 figures, 8 tables)

This paper contains 35 sections, 7 theorems, 45 equations, 5 figures, 8 tables.

Introduction
Background
Preliminaries: Attention Mechanisms
The intuition from existing linear attention research
Methodology
The Expansion of Softmax Function
Reweighted Zero-sum Softmax
ZeroS Linear Attention: Interaction Between Radial and Angular Components
Experiments
Language Modeling
Ablation Studies
Conclusion and Limitation
Technical Appendices and Supplementary Material
Additional Theoretical Discussion
Proof of Proposition \ref{['lem:convex-vs-zero']} (Convex vs. Zero-Sum Span).
...and 20 more sections

Key Result

Proposition 3.1

Let $\{\bm v_i\}_{i=1}^t\subset \mathbb{R}^d$, and write $\mathcal{C} =\{\sum_i\alpha_i\bm v_i: \alpha_i\ge0,\;\sum_i\alpha_i=1\}, \ \mathcal{Z} =\{\sum_i w_i\bm v_i: \sum_i w_i=0\},$ where we denote the $(t-1)$-simplex by $\Delta_{t-1}=\{\alpha\in \mathbb{R}^t:\alpha_i\ge0,\sum_i\alpha_i=1\}.$ The

Figures (5)

Figure 1: Illustration of the zero-sum linear attention block, including the computation of deviation logits and the reweighted zero-sum softmax operation
Figure 2: Evaluation of ZeroS on RegBench.
Figure 3: Performance evaluation on the MQAR benchmark, illustrating the relationship between model dimension (x-axis) and accuracy (y-axis). ZeroS demonstrates consistent performance advantages over other structures across all experimental configurations.
Figure 4: Performance Evaluation of ZeroS on OWT2
Figure 5: Block Architecture of The ZeroS-SM Layer

Theorems & Definitions (8)

Proposition 3.1: Convex vs. Zero-Sum Span
Corollary 3.2: Expressive Gain of Zero-Sum Attention
Proposition 3.3: Preservation of Affine Hull and Expressivity
Lemma 3.4: Numerical Stability of Zero-Sum Softmax
Proposition 3.5: Uniform Lipschitz Bound of Zero-Sum Softmax with decay factor $1/\sqrt{t}$
Proposition A.1: Convex vs. Zero‑sum Span
Corollary A.2: Expressive Capacity
proof : Proof sketch of Proposition \ref{['prop:convex-zero']} and Corollary \ref{['cor:expressive']}

ZeroS: Zero-Sum Linear Attention for Efficient Transformers

TL;DR

Abstract

ZeroS: Zero-Sum Linear Attention for Efficient Transformers

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (8)