Empowering Sequential Recommendation from Collaborative Signals and Semantic Relatedness

Mingyue Cheng; Hao Zhang; Qi Liu; Fajie Yuan; Zhi Li; Zhenya Huang; Enhong Chen; Jun Zhou; Longfei Li

Empowering Sequential Recommendation from Collaborative Signals and Semantic Relatedness

Mingyue Cheng, Hao Zhang, Qi Liu, Fajie Yuan, Zhi Li, Zhenya Huang, Enhong Chen, Jun Zhou, Longfei Li

TL;DR

This work tackles the limitation of traditional sequential recommender systems that rely solely on collaborative signals by integrating semantic relatedness from content features. It introduces TSSR, a two-stream architecture that treats item IDs and content features as separate modalities, and employs a hierarchical contrasting module with losses $\mathcal{L}_u$ and $\mathcal{L}_i$ alongside an autoregressive objective $\mathcal{L}_{ce}$ to align and fuse modalities via cross-attention and a gating mechanism. Empirical results on five public datasets show that TSSR consistently outperforms strong baselines, with notable gains in visually driven domains, while demonstrating robustness to data sparsity. The work provides a practical, end-to-end framework and releases its code, highlighting the value of cross-modal alignment for enhancing sequential recommendations, albeit with higher training costs that may be addressed with parameter-efficient strategies in future work.

Abstract

Sequential recommender systems (SRS) could capture dynamic user preferences by modeling historical behaviors ordered in time. Despite effectiveness, focusing only on the \textit{collaborative signals} from behaviors does not fully grasp user interests. It is also significant to model the \textit{semantic relatedness} reflected in content features, e.g., images and text. Towards that end, in this paper, we aim to enhance the SRS tasks by effectively unifying collaborative signals and semantic relatedness together. Notably, we empirically point out that it is nontrivial to achieve such a goal due to semantic gap issues. Thus, we propose an end-to-end two-stream architecture for sequential recommendation, named TSSR, to learn user preferences from ID-based and content-based sequence. Specifically, we first present novel hierarchical contrasting module, including coarse user-grained and fine item-grained terms, to align the representations of inter-modality. Furthermore, we also design a two-stream architecture to learn the dependence of intra-modality sequence and the complex interactions of inter-modality sequence, which can yield more expressive capacity in understanding user interests. We conduct extensive experiments on five public datasets. The experimental results show that the TSSR could yield superior performance than competitive baselines. We also make our experimental codes publicly available at https://github.com/Mingyue-Cheng/TSSR.

Empowering Sequential Recommendation from Collaborative Signals and Semantic Relatedness

TL;DR

and

alongside an autoregressive objective

to align and fuse modalities via cross-attention and a gating mechanism. Empirical results on five public datasets show that TSSR consistently outperforms strong baselines, with notable gains in visually driven domains, while demonstrating robustness to data sparsity. The work provides a practical, end-to-end framework and releases its code, highlighting the value of cross-modal alignment for enhancing sequential recommendations, albeit with higher training costs that may be addressed with parameter-efficient strategies in future work.

Abstract

Paper Structure (28 sections, 4 equations, 6 figures, 3 tables)

This paper contains 28 sections, 4 equations, 6 figures, 3 tables.

Introduction
Preliminaries
Problem Statement
Empirical Studies
The Proposed Model
Model Architecture Overview
Feature Representation
Aligning Unimodal Representations
User-grained Contrasting
Item-grained Contrasting
Multimodal Interaction and Fusion
Model Optimization
Experiments
Experimental Setup
Datasets
...and 13 more sections

Figures (6)

Figure 1: Visualization of item representations extracted from SASRec-ID and SASRec-Contents. It should be noted $1,000$ same items are shown in same line of figures, in which the up line denotes the results in the Yelp dataset while the below line reports the results in the Phone dataset.
Figure 2: Illustration of the TSSR model, a two-stream architecture for performing sequential recommendation, in which collaborative signals and semantic relatedness are unified together.
Figure 3: The effects of hierarchical contrasting for aligning the embeddings across different modalities, where left figure show the recommendation results and right figure reports the representation quality metric by alignment and uniformity.
Figure 4: Visualization of the t-SNE results of sampled items in the H&M dataset, in which left figure denote the item representations without hierarchical contrasting while the right figure denotes the results our full model.
Figure 5: The impact of batch size in the TSSR model, where left figure report the recommendation performance and right figure shows the representation quality results measured by alignment and uniformity.
...and 1 more figures

Empowering Sequential Recommendation from Collaborative Signals and Semantic Relatedness

TL;DR

Abstract

Empowering Sequential Recommendation from Collaborative Signals and Semantic Relatedness

Authors

TL;DR

Abstract

Table of Contents

Figures (6)