Table of Contents
Fetching ...

Cross-attention Secretly Performs Orthogonal Alignment in Recommendation Models

Hyunin Lee, Yong Zhang, Hoang Vu Nguyen, Xiaoyi Liu, Namyong Park, Christopher Jung, Rong Jin, Yang Wang, Zhigang Wang, Somayeh Sojoudi, Xue Feng

TL;DR

This work reframes cross-attention in cross-domain sequential recommendation by uncovering Orthogonal Alignment, a mechanism where cross-attention yields complementary information not present in the input. It introduces the Gated Cross Attention (GCA) module and demonstrates that X' often becomes orthogonal to X as model performance improves, without explicit orthogonality constraints. Through extensive experiments on three CDSR baselines (CDSRNP, ABXI, LLM4CDSR) across multiple dataset splits, it shows that early-stage GCA improves ranking metrics (NDCG, AUC), that orthogonalization correlates with performance gains, and that GCA provides a parameter-efficient path to scaling. The findings suggest a paradigm shift toward orthogonality-aware fusion in multi-modal, multi-domain settings, with implications for scalable deployment and novel evaluation metrics.

Abstract

Cross-domain sequential recommendation (CDSR) aims to align heterogeneous user behavior sequences collected from different domains. While cross-attention is widely used to enhance alignment and improve recommendation performance, its underlying mechanism is not fully understood. Most researchers interpret cross-attention as residual alignment, where the output is generated by removing redundant and preserving non-redundant information from the query input by referencing another domain data which is input key and value. Beyond the prevailing view, we introduce Orthogonal Alignment, a phenomenon in which cross-attention discovers novel information that is not present in the query input, and further argue that those two contrasting alignment mechanisms can co-exist in recommendation models We find that when the query input and output of cross-attention are orthogonal, model performance improves over 300 experiments. Notably, Orthogonal Alignment emerges naturally, without any explicit orthogonality constraints. Our key insight is that Orthogonal Alignment emerges naturally because it improves scaling law. We show that baselines additionally incorporating cross-attention module outperform parameter-matched baselines, achieving a superior accuracy-per-model parameter. We hope these findings offer new directions for parameter-efficient scaling in multi-modal research.

Cross-attention Secretly Performs Orthogonal Alignment in Recommendation Models

TL;DR

This work reframes cross-attention in cross-domain sequential recommendation by uncovering Orthogonal Alignment, a mechanism where cross-attention yields complementary information not present in the input. It introduces the Gated Cross Attention (GCA) module and demonstrates that X' often becomes orthogonal to X as model performance improves, without explicit orthogonality constraints. Through extensive experiments on three CDSR baselines (CDSRNP, ABXI, LLM4CDSR) across multiple dataset splits, it shows that early-stage GCA improves ranking metrics (NDCG, AUC), that orthogonalization correlates with performance gains, and that GCA provides a parameter-efficient path to scaling. The findings suggest a paradigm shift toward orthogonality-aware fusion in multi-modal, multi-domain settings, with implications for scalable deployment and novel evaluation metrics.

Abstract

Cross-domain sequential recommendation (CDSR) aims to align heterogeneous user behavior sequences collected from different domains. While cross-attention is widely used to enhance alignment and improve recommendation performance, its underlying mechanism is not fully understood. Most researchers interpret cross-attention as residual alignment, where the output is generated by removing redundant and preserving non-redundant information from the query input by referencing another domain data which is input key and value. Beyond the prevailing view, we introduce Orthogonal Alignment, a phenomenon in which cross-attention discovers novel information that is not present in the query input, and further argue that those two contrasting alignment mechanisms can co-exist in recommendation models We find that when the query input and output of cross-attention are orthogonal, model performance improves over 300 experiments. Notably, Orthogonal Alignment emerges naturally, without any explicit orthogonality constraints. Our key insight is that Orthogonal Alignment emerges naturally because it improves scaling law. We show that baselines additionally incorporating cross-attention module outperform parameter-matched baselines, achieving a superior accuracy-per-model parameter. We hope these findings offer new directions for parameter-efficient scaling in multi-modal research.

Paper Structure

This paper contains 17 sections, 1 equation, 8 figures, 6 tables.

Figures (8)

  • Figure 1: Conceptual illustration of Orthogonal Alignment. Given a source representation vector $Y$ from domain $B$, suppose the algorithm progressively updates target representation vector $X$ from domain $A$ throughout training iterations $\{X_1, X_2,\cdots, X'\}$. (a) Residual alignment: The prevailing view of the cross-attention is that it refines $X$ by reducing irrelevant and preserving relevant information by referring $Y$ to update $X'$. (b) Orthogonal Alignment: We observe a complement-discovery phenomenon where $X'$ becomes increasingly orthogonal to $X$ as model performance improves (Subsection \ref{['subsec:Observation 2']}). We show that this orthogonality emerges because cross-attention enables parameter-efficient scaling by extracting complementary information from an orthogonal manifold $T(X)$, thus enhancing performance without a proportional increase in parameters (Subsection \ref{['subsec:Observation 3']}). (c) $X'$ is the output of cross-attention, with $X$ as the query and $Y$ as the key and value.
  • Figure 2: In cross-domain sequential recommendation, various fusion structure ((a) $\sim$ (d)) across heterogeneous sequence data forms the central backbone of the model. In this work, we propose a gated cross-attention mechanism applied at the early interaction stage between two domain sequences. Empirical results show that this module consistently improves recommendation performance (see Section \ref{['subsec:Observation 1']}). Our main analysis reveals that a primary role of gated cross-attention is to induce orthogonal representations of the query inputs (see Section \ref{['subsec:Observation 2']}). Specifically, we observe that a reduction in cosine similarity between $X_A$ and its cross-attended counterpart $X_A'$ correlates strongly with enhanced recommendation accuracy.
  • Figure 3: For each baseline model, we insert $\mathop{\mathrm{\texttt{GCA}}}\nolimits$ modules at multiple vertical positions, denoted as $\mathop{\mathrm{\texttt{GCA}}}\nolimits[i]$, where $i=0$ corresponds to the module closest to the raw data and $i=N$ to the module farthest from the raw data. By design, $\mathop{\mathrm{\texttt{GCA}}}\nolimits[0]$ is always placed immediately after the embedding layer, while $\mathop{\mathrm{\texttt{GCA}}}\nolimits[1], \mathop{\mathrm{\texttt{GCA}}}\nolimits[2], \ldots$ are positioned within intermediate layers of the backbone. Each $\mathop{\mathrm{\texttt{GCA}}}\nolimits[i]$ comprises two parallel gated cross-attention modules, which respectively refine the representations of domains $A$ and $B$.
  • Figure 4: NDCG@$\{1,10\}$–AUC correlations differ by backbone: ABXI exhibits a consistent negative correlation across domains, while LLM4CDSR shows a consistent positive correlation.
  • Figure 5: Effect of vertically stacking $\mathop{\mathrm{\texttt{GCA}}}\nolimits$ modules. Increasing the number of insertions (i.e. insert more $\mathop{\mathrm{\texttt{GCA}}}\nolimits[i]\text{s}, i>1$) does not yield monotonic gains: LLM4CDSR peaks at [1], CDSRNP saturates beyond [0,1], and ABXI achieves its best median NDCG at [1,2] but suffers negative transfer with other placements.
  • ...and 3 more figures