Table of Contents
Fetching ...

SOLAR: SVD-Optimized Lifelong Attention for Recommendation

Chenghao Zhang, Chao Feng, Yuanhao Pu, Xunyong Yang, Wenhui Yu, Xiang Li, Yongqi Liu, Lantao Hu, Kaiqiao Zhan, Han Li, Kun Gai

TL;DR

SVD-Attention is introduced, which is theoretically lossless on low-rank matrices and preserves softmax while reducing attention complexity from $O(N^2 d)$ to $O(Ndr)$.

Abstract

Attention mechanism remains the defining operator in Transformers since it provides expressive global credit assignment, yet its $O(N^2 d)$ time and memory cost in sequence length $N$ makes long-context modeling expensive and often forces truncation or other heuristics. Linear attention reduces complexity to $O(N d^2)$ by reordering computation through kernel feature maps, but this reformulation drops the softmax mechanism and shifts the attention score distribution. In recommender systems, low-rank structure in matrices is not a rare case, but rather the default inductive bias in its representation learning, particularly explicit in the user behavior sequence modeling. Leveraging this structure, we introduce SVD-Attention, which is theoretically lossless on low-rank matrices and preserves softmax while reducing attention complexity from $O(N^2 d)$ to $O(Ndr)$. With SVD-Attention, we propose SOLAR, SVD-Optimized Lifelong Attention for Recommendation, a sequence modeling framework that supports behavior sequences of ten-thousand scale and candidate sets of several thousand items in cascading process without any filtering. In Kuaishou's online recommendation scenario, SOLAR delivers a 0.68\% Video Views gain together with additional business metrics improvements.

SOLAR: SVD-Optimized Lifelong Attention for Recommendation

TL;DR

SVD-Attention is introduced, which is theoretically lossless on low-rank matrices and preserves softmax while reducing attention complexity from to .

Abstract

Attention mechanism remains the defining operator in Transformers since it provides expressive global credit assignment, yet its time and memory cost in sequence length makes long-context modeling expensive and often forces truncation or other heuristics. Linear attention reduces complexity to by reordering computation through kernel feature maps, but this reformulation drops the softmax mechanism and shifts the attention score distribution. In recommender systems, low-rank structure in matrices is not a rare case, but rather the default inductive bias in its representation learning, particularly explicit in the user behavior sequence modeling. Leveraging this structure, we introduce SVD-Attention, which is theoretically lossless on low-rank matrices and preserves softmax while reducing attention complexity from to . With SVD-Attention, we propose SOLAR, SVD-Optimized Lifelong Attention for Recommendation, a sequence modeling framework that supports behavior sequences of ten-thousand scale and candidate sets of several thousand items in cascading process without any filtering. In Kuaishou's online recommendation scenario, SOLAR delivers a 0.68\% Video Views gain together with additional business metrics improvements.
Paper Structure (34 sections, 8 theorems, 81 equations, 4 figures, 4 tables, 1 algorithm)

This paper contains 34 sections, 8 theorems, 81 equations, 4 figures, 4 tables, 1 algorithm.

Key Result

Theorem 4.2

Let denote the global marginal preference probability of item $i$ over $j$. The minimizer $f^\star \in \arg\min_{f \in \mathcal{F}_{\mathrm{point}}} R(f)$ satisfies: Equivalently, $\sigma(f^\star(x_i) - f^\star(x_j)) = p_{ij}$.

Figures (4)

  • Figure 1: Low rank nature of user sequence representation. This figure shows the cumulative distribution of eigenvalues from SVD decomposition. At rank 27, all information is captured.
  • Figure 2: Overview of Attention Complexity. Softmax attention explicitly forms the dense attention matrix $QK^{\top}\in\mathbb{R}^{N\times N}$ and then multiplies by $V$, resulting in $O(N^{2}d)$ time complexity. Linear attention leverages associativity to rewrite the computation as $Q\,(K^{\top}V)$, avoiding the $N\times N$ matrix and reducing complexity to $O(Nd^{2})$. SVD-Attention applies a rank-$r$ SVD ($r\ll d$) to obtain a low-rank factorization. After reassembling the decomposed $U \Sigma V$, the computation of $U$ can theoretically be eliminated, as $U^\top U = I$, in low rank representations, which maintains the order of computation as in traditional attention, yielding an overall complexity of $O(Ndr)$.
  • Figure 3: Illustration of SOLAR, an architecture that applies SVD- Attention to historical sequence modeling and includes set-wise modeling of candidate set.
  • Figure 4: Forward latency of the attention module on CPU under single-thread execution. The x-axis varies the history length $N$ and the y-axis is the time in milliseconds. Candidate size $m$ and embedding dimension $d$ are held fixed.

Theorems & Definitions (18)

  • Definition 3.1: Scoring Functions
  • Definition 3.2: Bipartite Ranking Risk
  • Definition 3.3: Bayes-optimal Scorer
  • Definition 4.1: Contextual Flip
  • Theorem 4.2: Bayes Limit of Point-wise Scorers
  • Corollary 4.3: Irreducible Ranking Risk(Proof in Appendix \ref{['pf:context_flip_pairwise']})
  • Theorem 4.4: I.I.D., proof in Appendix \ref{['pf:point']}
  • Theorem 4.5: Block Dependent, proof in Appendix \ref{['pf:block']}
  • Corollary 4.6: Mismatch factor and extreme regimes
  • Lemma 4.7: Lipschitz Continuity, proof in Appendix \ref{['pf:lipschitz']}
  • ...and 8 more