Cross-attention Secretly Performs Orthogonal Alignment in Recommendation Models
Hyunin Lee, Yong Zhang, Hoang Vu Nguyen, Xiaoyi Liu, Namyong Park, Christopher Jung, Rong Jin, Yang Wang, Zhigang Wang, Somayeh Sojoudi, Xue Feng
TL;DR
This work reframes cross-attention in cross-domain sequential recommendation by uncovering Orthogonal Alignment, a mechanism where cross-attention yields complementary information not present in the input. It introduces the Gated Cross Attention (GCA) module and demonstrates that X' often becomes orthogonal to X as model performance improves, without explicit orthogonality constraints. Through extensive experiments on three CDSR baselines (CDSRNP, ABXI, LLM4CDSR) across multiple dataset splits, it shows that early-stage GCA improves ranking metrics (NDCG, AUC), that orthogonalization correlates with performance gains, and that GCA provides a parameter-efficient path to scaling. The findings suggest a paradigm shift toward orthogonality-aware fusion in multi-modal, multi-domain settings, with implications for scalable deployment and novel evaluation metrics.
Abstract
Cross-domain sequential recommendation (CDSR) aims to align heterogeneous user behavior sequences collected from different domains. While cross-attention is widely used to enhance alignment and improve recommendation performance, its underlying mechanism is not fully understood. Most researchers interpret cross-attention as residual alignment, where the output is generated by removing redundant and preserving non-redundant information from the query input by referencing another domain data which is input key and value. Beyond the prevailing view, we introduce Orthogonal Alignment, a phenomenon in which cross-attention discovers novel information that is not present in the query input, and further argue that those two contrasting alignment mechanisms can co-exist in recommendation models We find that when the query input and output of cross-attention are orthogonal, model performance improves over 300 experiments. Notably, Orthogonal Alignment emerges naturally, without any explicit orthogonality constraints. Our key insight is that Orthogonal Alignment emerges naturally because it improves scaling law. We show that baselines additionally incorporating cross-attention module outperform parameter-matched baselines, achieving a superior accuracy-per-model parameter. We hope these findings offer new directions for parameter-efficient scaling in multi-modal research.
