Table of Contents
Fetching ...

Your Causal Self-Attentive Recommender Hosts a Lonely Neighborhood

Yueqi Wang, Zhankui He, Zhenrui Yue, Julian McAuley, Dong Wang

TL;DR

This work addresses the ambiguous performance trade-offs between auto-encoding (AE) and auto-regressive (AR) self-attention in sequential recommendation. It introduces two theoretically grounded metrics—sparsity of the attention matrix and a rank-$k$ low-rank approximation—and a modular experimental framework (ModSAR) to study AE vs AR across vanilla, variant, and HuggingFace models. The findings show AR attention exhibits a sparse local neighborhood bias and stores richer data dynamics, requiring higher-rank representations, and empirically AR outperforms AE across five diverse datasets and design spaces, including NLP-model integrations. The paper argues for adopting AR as the more robust starting point for future self-attentive recommender designs and provides open-source tooling to accelerate research and design space exploration.

Abstract

In the context of sequential recommendation, a pivotal issue pertains to the comparative analysis between bi-directional/auto-encoding (AE) and uni-directional/auto-regressive (AR) attention mechanisms, where the conclusions regarding architectural and performance superiority remain inconclusive. Previous efforts in such comparisons primarily involve summarizing existing works to identify a consensus or conducting ablation studies on peripheral modeling techniques, such as choices of loss functions. However, far fewer efforts have been made in (1) theoretical and (2) extensive empirical analysis of the self-attention module, the very pivotal structure on which performance and designing insights should be anchored. In this work, we first provide a comprehensive theoretical analysis of AE/AR attention matrix in the aspect of (1) sparse local inductive bias, a.k.a neighborhood effects, and (2) low rank approximation. Analytical metrics reveal that the AR attention exhibits sparse neighborhood effects suitable for generally sparse recommendation scenarios. Secondly, to support our theoretical analysis, we conduct extensive empirical experiments on comparing AE/AR attention on five popular benchmarks with AR performing better overall. Empirical results reported are based on our experimental pipeline named Modularized Design Space for Self-Attentive Recommender (ModSAR), supporting adaptive hyperparameter tuning, modularized design space and HuggingFace plug-ins. We invite the recommendation community to utilize/contribute to ModSAR to (1) conduct more module/model-level examining beyond AE/AR comparison and (2) accelerate state-of-the-art model design. Lastly, we shed light on future design choices for performant self-attentive recommenders. We make our pipeline implementation and data available at https://github.com/yueqirex/SAR-Check.

Your Causal Self-Attentive Recommender Hosts a Lonely Neighborhood

TL;DR

This work addresses the ambiguous performance trade-offs between auto-encoding (AE) and auto-regressive (AR) self-attention in sequential recommendation. It introduces two theoretically grounded metrics—sparsity of the attention matrix and a rank- low-rank approximation—and a modular experimental framework (ModSAR) to study AE vs AR across vanilla, variant, and HuggingFace models. The findings show AR attention exhibits a sparse local neighborhood bias and stores richer data dynamics, requiring higher-rank representations, and empirically AR outperforms AE across five diverse datasets and design spaces, including NLP-model integrations. The paper argues for adopting AR as the more robust starting point for future self-attentive recommender designs and provides open-source tooling to accelerate research and design space exploration.

Abstract

In the context of sequential recommendation, a pivotal issue pertains to the comparative analysis between bi-directional/auto-encoding (AE) and uni-directional/auto-regressive (AR) attention mechanisms, where the conclusions regarding architectural and performance superiority remain inconclusive. Previous efforts in such comparisons primarily involve summarizing existing works to identify a consensus or conducting ablation studies on peripheral modeling techniques, such as choices of loss functions. However, far fewer efforts have been made in (1) theoretical and (2) extensive empirical analysis of the self-attention module, the very pivotal structure on which performance and designing insights should be anchored. In this work, we first provide a comprehensive theoretical analysis of AE/AR attention matrix in the aspect of (1) sparse local inductive bias, a.k.a neighborhood effects, and (2) low rank approximation. Analytical metrics reveal that the AR attention exhibits sparse neighborhood effects suitable for generally sparse recommendation scenarios. Secondly, to support our theoretical analysis, we conduct extensive empirical experiments on comparing AE/AR attention on five popular benchmarks with AR performing better overall. Empirical results reported are based on our experimental pipeline named Modularized Design Space for Self-Attentive Recommender (ModSAR), supporting adaptive hyperparameter tuning, modularized design space and HuggingFace plug-ins. We invite the recommendation community to utilize/contribute to ModSAR to (1) conduct more module/model-level examining beyond AE/AR comparison and (2) accelerate state-of-the-art model design. Lastly, we shed light on future design choices for performant self-attentive recommenders. We make our pipeline implementation and data available at https://github.com/yueqirex/SAR-Check.
Paper Structure (24 sections, 12 equations, 5 figures, 6 tables)

This paper contains 24 sections, 12 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Attention visualizations. local effect grows from left to right.
  • Figure 2: Attention visualizations. First row is original matrix; Second row is corresponding low-rank approximations using top-5 largest singular values. Vanilla-AE (BERT4Rec-like here) attention has a clear-pattern rank-5 approximation while local-attn and AR losses their patterns and need a higher-rank approximation.
  • Figure 3: Singular values distribution for random user X in descending order for Vanilla-{AE, AR} (BERT4Rec-like here), AE's singular values drops to near zero in a faster rate than AR, suggesting a lower rank approximation.
  • Figure 4: The overall architecture for our ModSAR. The right part introduces the self-attentive backbone controlled by the left part of our modularized design space of {Feature, Modeling, Loss and Task}. User can also choose between {self-attention + design space, Huggingface models}. Ray tune and ASHA-adaptive search are utilized to manage experiments.
  • Figure 5: \ref{['fig:bar_locality']} shows the average performance increases with growing local inductive bias consistent with our theoretical analysis. Local-attn in \ref{['fig:bar_SAR']} shows lowest improvements due to already injected inductive bias in Local-attn-AE.