An Attentive Inductive Bias for Sequential Recommendation beyond the Self-Attention

Yehjin Shin; Jeongwhan Choi; Hyowon Wi; Noseong Park

An Attentive Inductive Bias for Sequential Recommendation beyond the Self-Attention

Yehjin Shin, Jeongwhan Choi, Hyowon Wi, Noseong Park

TL;DR

This work presents a novel method called Beyond Self-Attention for Sequential Recommendation (BSARec), which leverages the Fourier transform to inject an inductive bias by considering fine-grained sequential patterns and integrate low and high-frequency information to mitigate oversmoothing.

Abstract

Sequential recommendation (SR) models based on Transformers have achieved remarkable successes. The self-attention mechanism of Transformers for computer vision and natural language processing suffers from the oversmoothing problem, i.e., hidden representations becoming similar to tokens. In the SR domain, we, for the first time, show that the same problem occurs. We present pioneering investigations that reveal the low-pass filtering nature of self-attention in the SR, which causes oversmoothing. To this end, we propose a novel method called $\textbf{B}$eyond $\textbf{S}$elf-$\textbf{A}$ttention for Sequential $\textbf{Rec}$ommendation (BSARec), which leverages the Fourier transform to i) inject an inductive bias by considering fine-grained sequential patterns and ii) integrate low and high-frequency information to mitigate oversmoothing. Our discovery shows significant advancements in the SR domain and is expected to bridge the gap for existing Transformer-based SR models. We test our proposed approach through extensive experiments on 6 benchmark datasets. The experimental results demonstrate that our model outperforms 7 baseline methods in terms of recommendation performance. Our code is available at https://github.com/yehjin-shin/BSARec.

An Attentive Inductive Bias for Sequential Recommendation beyond the Self-Attention

TL;DR

Abstract

eyond

elf-

ttention for Sequential

ommendation (BSARec), which leverages the Fourier transform to i) inject an inductive bias by considering fine-grained sequential patterns and ii) integrate low and high-frequency information to mitigate oversmoothing. Our discovery shows significant advancements in the SR domain and is expected to bridge the gap for existing Transformer-based SR models. We test our proposed approach through extensive experiments on 6 benchmark datasets. The experimental results demonstrate that our model outperforms 7 baseline methods in terms of recommendation performance. Our code is available at https://github.com/yehjin-shin/BSARec.

Paper Structure (44 sections, 1 theorem, 18 equations, 9 figures, 9 tables)

This paper contains 44 sections, 1 theorem, 18 equations, 9 figures, 9 tables.

Introduction
Preliminaries
Problem Formulation
Self-Attention for Sequential Recommendation
Discrete vs. Graph Fourier Transform
Motivation
Proposed Method
Embedding Layer
Beyond Self-Attention Encoder
Beyond Self-Attention Layer
Attentive Inductive Bias with Frequency Rescaler
Meaning of our Attentive Inductive Bias
Point-wise Feed-Forward Network and Layer Outputs
Prediction Layer and Training
Relation to Previous Models
...and 29 more sections

Key Result

Theorem 1

Let $\mathbf{A} = \textrm{softmax}(\mathbf{Q}\mathbf{K}^{\mathtt{T}}/\sqrt{d} )$. Then $\mathbf{A}$ inherently acts as a low-pass filter. For all $\bm{x}\in\mathbb{R}^N$, in other words, $\lim_{t\rightarrow \infty} ||\text{HFC}[\mathbf{A}^t(\bm{x})]||_2 / ||\text{LFC}[\mathbf{A}^t(\bm{x})]||_2=0$.

Figures (9)

Figure 1: Illustration of high and low-frequency signals in SR. A user $u_1$'s long-term persisting interests and tastes constitute low frequencies in the Fourier domain of embedding, and abrupt short-term changes in $u_1$'s interests correspond to high frequencies.
Figure 2: (a) A ring graph with $N$ nodes, and (b) visualization of the filter of the self-attentions in LastFM.
Figure 3: Visualization of oversmoothing in LastFM. The singular values and cosine similarity of user sequence output embedding.
Figure 4: Architecture of our proposed BSARec. We propose a BSA encoder that uses both an inductive bias with a frequency rescaler and original self-attention.
Figure 5: Sensitivity to $\alpha$. More results in other datasets are in Appendix.
...and 4 more figures

Theorems & Definitions (3)

Theorem 1: Self-Attention is a low-pass filter
Definition 2
proof

An Attentive Inductive Bias for Sequential Recommendation beyond the Self-Attention

TL;DR

Abstract

An Attentive Inductive Bias for Sequential Recommendation beyond the Self-Attention

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (9)

Theorems & Definitions (3)