Position-Aware Sequential Attention for Accurate Next Item Recommendations

Timur Nabiev; Evgeny Frolov

Position-Aware Sequential Attention for Accurate Next Item Recommendations

Timur Nabiev, Evgeny Frolov

TL;DR

A kernelized self-attention mechanism is introduced, where a learnable positional kernel operates purely in the position space, disentangled from semantic similarity, and directly modulates attention weights when applied per attention block, which enables adaptive multi-scale sequential modeling.

Abstract

Sequential self-attention models usually rely on additive positional embeddings, which inject positional information into item representations at the input. In the absence of positional signals, the attention block is permutation-equivariant over sequence positions and thus has no intrinsic notion of temporal order beyond causal masking. We argue that additive positional embeddings make the attention mechanism only superficially sensitive to sequence order: positional information is entangled with item embedding semantics, propagates weakly in deep architectures, and limits the ability to capture rich sequential patterns. To address these limitations, we introduce a kernelized self-attention mechanism, where a learnable positional kernel operates purely in the position space, disentangled from semantic similarity, and directly modulates attention weights. When applied per attention block, this kernel enables adaptive multi-scale sequential modeling. Experiments on standard next-item prediction benchmarks show that our positional kernel attention consistently improves over strong competing baselines.

Position-Aware Sequential Attention for Accurate Next Item Recommendations

TL;DR

Abstract

Paper Structure (31 sections, 12 equations, 3 figures, 4 tables)

This paper contains 31 sections, 12 equations, 3 figures, 4 tables.

Introduction
Background and problem formulation
Sequential self-attention
Analytical view of permutation-equivariance
Attention under sequence permutation
Partial position sensitivity with causal masking
Limitations of ad-hoc positional encoding
Gradient dilution across layers.
Multi-scale sequential patterns.
Position-content interference.
Proposed approach: position-aware kernel
Targeting the position awareness
Learning with positional bilinear operators
Algebraic interpretation of the kernel
Implementation
...and 16 more sections

Figures (3)

Figure 1: Visualization of trained $\mathbf{U}$ (per layer) and $\mathbf{L}$ matrices on the ml-1m dataset.
Figure 2: Heatmap of attention matrices of models on the Y-listens dataset
Figure 3: Training dynamics of models on yelp dataset

Position-Aware Sequential Attention for Accurate Next Item Recommendations

TL;DR

Abstract

Position-Aware Sequential Attention for Accurate Next Item Recommendations

Authors

TL;DR

Abstract

Table of Contents

Figures (3)