SOLAR: SVD-Optimized Lifelong Attention for Recommendation

Chenghao Zhang; Chao Feng; Yuanhao Pu; Xunyong Yang; Wenhui Yu; Xiang Li; Yongqi Liu; Lantao Hu; Kaiqiao Zhan; Han Li; Kun Gai

SOLAR: SVD-Optimized Lifelong Attention for Recommendation

Chenghao Zhang, Chao Feng, Yuanhao Pu, Xunyong Yang, Wenhui Yu, Xiang Li, Yongqi Liu, Lantao Hu, Kaiqiao Zhan, Han Li, Kun Gai

TL;DR

SVD-Attention is introduced, which is theoretically lossless on low-rank matrices and preserves softmax while reducing attention complexity from $O(N^2 d)$ to $O(Ndr)$.

Abstract

Attention mechanism remains the defining operator in Transformers since it provides expressive global credit assignment, yet its $O(N^2 d)$ time and memory cost in sequence length $N$ makes long-context modeling expensive and often forces truncation or other heuristics. Linear attention reduces complexity to $O(N d^2)$ by reordering computation through kernel feature maps, but this reformulation drops the softmax mechanism and shifts the attention score distribution. In recommender systems, low-rank structure in matrices is not a rare case, but rather the default inductive bias in its representation learning, particularly explicit in the user behavior sequence modeling. Leveraging this structure, we introduce SVD-Attention, which is theoretically lossless on low-rank matrices and preserves softmax while reducing attention complexity from $O(N^2 d)$ to $O(Ndr)$. With SVD-Attention, we propose SOLAR, SVD-Optimized Lifelong Attention for Recommendation, a sequence modeling framework that supports behavior sequences of ten-thousand scale and candidate sets of several thousand items in cascading process without any filtering. In Kuaishou's online recommendation scenario, SOLAR delivers a 0.68\% Video Views gain together with additional business metrics improvements.

SOLAR: SVD-Optimized Lifelong Attention for Recommendation

TL;DR

SVD-Attention is introduced, which is theoretically lossless on low-rank matrices and preserves softmax while reducing attention complexity from

Abstract

Attention mechanism remains the defining operator in Transformers since it provides expressive global credit assignment, yet its

time and memory cost in sequence length

makes long-context modeling expensive and often forces truncation or other heuristics. Linear attention reduces complexity to

by reordering computation through kernel feature maps, but this reformulation drops the softmax mechanism and shifts the attention score distribution. In recommender systems, low-rank structure in matrices is not a rare case, but rather the default inductive bias in its representation learning, particularly explicit in the user behavior sequence modeling. Leveraging this structure, we introduce SVD-Attention, which is theoretically lossless on low-rank matrices and preserves softmax while reducing attention complexity from

. With SVD-Attention, we propose SOLAR, SVD-Optimized Lifelong Attention for Recommendation, a sequence modeling framework that supports behavior sequences of ten-thousand scale and candidate sets of several thousand items in cascading process without any filtering. In Kuaishou's online recommendation scenario, SOLAR delivers a 0.68\% Video Views gain together with additional business metrics improvements.

Paper Structure (34 sections, 8 theorems, 81 equations, 4 figures, 4 tables, 1 algorithm)

This paper contains 34 sections, 8 theorems, 81 equations, 4 figures, 4 tables, 1 algorithm.

Introduction
Contributions.
Related Work
Attention and Efficient Attention
Sequential Modeling in Recommendation Systems
Preliminaries
Methodology
SVD-Attention
Efficient SVD Forward Pass
Backward Pass Through SVD
Set-Wise Architectures
Ranking Bias: Point-wise Bottleneck
Generalization Bounds
Evaluation
Main results
...and 19 more sections

Key Result

Theorem 4.2

Let denote the global marginal preference probability of item $i$ over $j$. The minimizer $f^\star \in \arg\min_{f \in \mathcal{F}_{\mathrm{point}}} R(f)$ satisfies: Equivalently, $\sigma(f^\star(x_i) - f^\star(x_j)) = p_{ij}$.

Figures (4)

Figure 1: Low rank nature of user sequence representation. This figure shows the cumulative distribution of eigenvalues from SVD decomposition. At rank 27, all information is captured.
Figure 2: Overview of Attention Complexity. Softmax attention explicitly forms the dense attention matrix $QK^{\top}\in\mathbb{R}^{N\times N}$ and then multiplies by $V$, resulting in $O(N^{2}d)$ time complexity. Linear attention leverages associativity to rewrite the computation as $Q\,(K^{\top}V)$, avoiding the $N\times N$ matrix and reducing complexity to $O(Nd^{2})$. SVD-Attention applies a rank-$r$ SVD ($r\ll d$) to obtain a low-rank factorization. After reassembling the decomposed $U \Sigma V$, the computation of $U$ can theoretically be eliminated, as $U^\top U = I$, in low rank representations, which maintains the order of computation as in traditional attention, yielding an overall complexity of $O(Ndr)$.
Figure 3: Illustration of SOLAR, an architecture that applies SVD- Attention to historical sequence modeling and includes set-wise modeling of candidate set.
Figure 4: Forward latency of the attention module on CPU under single-thread execution. The x-axis varies the history length $N$ and the y-axis is the time in milliseconds. Candidate size $m$ and embedding dimension $d$ are held fixed.

Theorems & Definitions (18)

Definition 3.1: Scoring Functions
Definition 3.2: Bipartite Ranking Risk
Definition 3.3: Bayes-optimal Scorer
Definition 4.1: Contextual Flip
Theorem 4.2: Bayes Limit of Point-wise Scorers
Corollary 4.3: Irreducible Ranking Risk(Proof in Appendix \ref{['pf:context_flip_pairwise']})
Theorem 4.4: I.I.D., proof in Appendix \ref{['pf:point']}
Theorem 4.5: Block Dependent, proof in Appendix \ref{['pf:block']}
Corollary 4.6: Mismatch factor and extreme regimes
Lemma 4.7: Lipschitz Continuity, proof in Appendix \ref{['pf:lipschitz']}
...and 8 more

SOLAR: SVD-Optimized Lifelong Attention for Recommendation

TL;DR

Abstract

SOLAR: SVD-Optimized Lifelong Attention for Recommendation

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (18)