Empowering Sequential Recommendation from Collaborative Signals and Semantic Relatedness
Mingyue Cheng, Hao Zhang, Qi Liu, Fajie Yuan, Zhi Li, Zhenya Huang, Enhong Chen, Jun Zhou, Longfei Li
TL;DR
This work tackles the limitation of traditional sequential recommender systems that rely solely on collaborative signals by integrating semantic relatedness from content features. It introduces TSSR, a two-stream architecture that treats item IDs and content features as separate modalities, and employs a hierarchical contrasting module with losses $\mathcal{L}_u$ and $\mathcal{L}_i$ alongside an autoregressive objective $\mathcal{L}_{ce}$ to align and fuse modalities via cross-attention and a gating mechanism. Empirical results on five public datasets show that TSSR consistently outperforms strong baselines, with notable gains in visually driven domains, while demonstrating robustness to data sparsity. The work provides a practical, end-to-end framework and releases its code, highlighting the value of cross-modal alignment for enhancing sequential recommendations, albeit with higher training costs that may be addressed with parameter-efficient strategies in future work.
Abstract
Sequential recommender systems (SRS) could capture dynamic user preferences by modeling historical behaviors ordered in time. Despite effectiveness, focusing only on the \textit{collaborative signals} from behaviors does not fully grasp user interests. It is also significant to model the \textit{semantic relatedness} reflected in content features, e.g., images and text. Towards that end, in this paper, we aim to enhance the SRS tasks by effectively unifying collaborative signals and semantic relatedness together. Notably, we empirically point out that it is nontrivial to achieve such a goal due to semantic gap issues. Thus, we propose an end-to-end two-stream architecture for sequential recommendation, named TSSR, to learn user preferences from ID-based and content-based sequence. Specifically, we first present novel hierarchical contrasting module, including coarse user-grained and fine item-grained terms, to align the representations of inter-modality. Furthermore, we also design a two-stream architecture to learn the dependence of intra-modality sequence and the complex interactions of inter-modality sequence, which can yield more expressive capacity in understanding user interests. We conduct extensive experiments on five public datasets. The experimental results show that the TSSR could yield superior performance than competitive baselines. We also make our experimental codes publicly available at https://github.com/Mingyue-Cheng/TSSR.
