Table of Contents
Fetching ...

Gated Rotary-Enhanced Linear Attention for Long-term Sequential Recommendation

Juntao Hu, Wei Zhou, Huayi Shen, Xiao Du, Jie Liao, Min Gao, Jun Zeng, Junhao Wen

TL;DR

This work addresses the inefficiency of modeling long-term user sequences in sequential recommendation by marrying rotary position encoding with linear attention. It introduces RELA, which applies RoPE within a linear-attention framework, and GRELA, which adds a Local Shortcut and a SiLU-based gating mechanism to differentiate short-term bursts from genuine long-term shifts. Empirical results on five large benchmarks show RecGRELA achieving state-of-the-art or competitive performance with substantially reduced memory usage, and ablation analyses confirm the importance of gating, RoPE, and local modeling. The approach offers a scalable, accurate alternative to transformer or RNN-based SRS models, with potential for extension to session-based and multi-modal settings.

Abstract

In Sequential Recommendation Systems (SRSs), Transformer models have demonstrated remarkable performance but face computational and memory cost challenges, especially when modeling long-term user behavior sequences. Due to its quadratic complexity, the dot-product attention mechanism in Transformers becomes expensive for processing long sequences. By approximating the dot-product attention using elaborate mapping functions, linear attention provides a more efficient option with linear complexity. However, existing linear attention methods face three limitations: 1) they often use learnable position encodings, which incur extra computational costs in long-term sequence scenarios, 2) they may not sufficiently account for user's fine-grained local preferences (short-lived burst of interest), and 3) they try to capture some temporary activities, but often confuse these with stable and long-term interests. This can result in unclear or less effective recommendations. To remedy these drawbacks, we propose a long-term sequential Recommendation model with Gated Rotary Enhanced Linear Attention (RecGRELA). Specifically, we first propose a Rotary-Enhanced Linear Attention (RELA) module to efficiently model long-range dependency within the user's historical information using rotary position encodings. Then, we introduce a local short operation to add the local preferences of interactions and show the theoretical insight. We further introduce a SiLU-based Gated mechanism for RELA (GRELA) to help the model tell if a user behavior shows a short-term, local interest or a real change in their long-term tastes. Experimental results on four public benchmark datasets show that our RecGRELA achieves state-of-the-art performance compared with existing SRSs based on Recurrent Neural Networks, Transformer, and Mamba while keeping low memory overhead.

Gated Rotary-Enhanced Linear Attention for Long-term Sequential Recommendation

TL;DR

This work addresses the inefficiency of modeling long-term user sequences in sequential recommendation by marrying rotary position encoding with linear attention. It introduces RELA, which applies RoPE within a linear-attention framework, and GRELA, which adds a Local Shortcut and a SiLU-based gating mechanism to differentiate short-term bursts from genuine long-term shifts. Empirical results on five large benchmarks show RecGRELA achieving state-of-the-art or competitive performance with substantially reduced memory usage, and ablation analyses confirm the importance of gating, RoPE, and local modeling. The approach offers a scalable, accurate alternative to transformer or RNN-based SRS models, with potential for extension to session-based and multi-modal settings.

Abstract

In Sequential Recommendation Systems (SRSs), Transformer models have demonstrated remarkable performance but face computational and memory cost challenges, especially when modeling long-term user behavior sequences. Due to its quadratic complexity, the dot-product attention mechanism in Transformers becomes expensive for processing long sequences. By approximating the dot-product attention using elaborate mapping functions, linear attention provides a more efficient option with linear complexity. However, existing linear attention methods face three limitations: 1) they often use learnable position encodings, which incur extra computational costs in long-term sequence scenarios, 2) they may not sufficiently account for user's fine-grained local preferences (short-lived burst of interest), and 3) they try to capture some temporary activities, but often confuse these with stable and long-term interests. This can result in unclear or less effective recommendations. To remedy these drawbacks, we propose a long-term sequential Recommendation model with Gated Rotary Enhanced Linear Attention (RecGRELA). Specifically, we first propose a Rotary-Enhanced Linear Attention (RELA) module to efficiently model long-range dependency within the user's historical information using rotary position encodings. Then, we introduce a local short operation to add the local preferences of interactions and show the theoretical insight. We further introduce a SiLU-based Gated mechanism for RELA (GRELA) to help the model tell if a user behavior shows a short-term, local interest or a real change in their long-term tastes. Experimental results on four public benchmark datasets show that our RecGRELA achieves state-of-the-art performance compared with existing SRSs based on Recurrent Neural Networks, Transformer, and Mamba while keeping low memory overhead.

Paper Structure

This paper contains 34 sections, 25 equations, 9 figures, 3 tables.

Figures (9)

  • Figure 1: Illustration of differences of SASRec, LinRec, and our RecGRELA. $\phi$ denotes the kernel function.
  • Figure 2: The overview of our proposed architecture. The Overall Architecture (a) processes an input sequence through an Embedding Layer, which is then fed into a stack of $L$ identical encoder layers. Each encoder layer consists of Layer Normalization, a GRELA Block, another Layer Normalization, and an MLP, finally leading to a Prediction Layer. The central components are the GRELA Block (b), which utilizes RoPE (Rotary Position Encoding, d), and the RELA module (c). Within the RELA module, query ($Q$) and key ($K$) are processed by ELU activation and the RoPE. Attention scores are computed via matrix multiplication of the RoPE-enhanced $Q$ and $K$ (after $K$ is scaled), followed by another scaling and a concatenation step. A Causal Conv1D and SiLU activation processes the value ($V$) projection to capture local context.
  • Figure 3: RecGRELA vs. baselines in terms of the GPU memory and FLOPs of the training stage on the ML-1M dataset.
  • Figure 4: The effectiveness of each variant of RecGRELA on ML-1M.
  • Figure 5: The performance of different position encodings and activation functions.
  • ...and 4 more figures