Table of Contents
Fetching ...

Image Fusion for Cross-Domain Sequential Recommendation

Wangyu Wu, Siqi Song, Xianglin Qiu, Xiaowei Huang, Fei Ma, Jimin Xiao

TL;DR

This work tackles Cross-Domain Sequential Recommendation (CDSR) by addressing domain bias and the underutilization of visual item representations. It introduces IFCDSR, which fuses frozen CLIP derived image embeddings with learnable item ID embeddings and processes three cross-domain sequences through a multi-attention architecture to capture both intra-domain and cross-domain user preferences. The approach jointly learns single-domain and cross-domain interests, and predictions are made by combining ID and image based similarities in a fused probability. Experiments on re-partitioned Amazon CDSR datasets show IFCDSR achieving state-of-the-art results, demonstrating the practical benefit of incorporating visual signals into cross-domain recommendation systems.

Abstract

Cross-Domain Sequential Recommendation (CDSR) aims to predict future user interactions based on historical interactions across multiple domains. The key challenge in CDSR is effectively capturing cross-domain user preferences by fully leveraging both intra-sequence and inter-sequence item interactions. In this paper, we propose a novel method, Image Fusion for Cross-Domain Sequential Recommendation (IFCDSR), which incorporates item image information to better capture visual preferences. Our approach integrates a frozen CLIP model to generate image embeddings, enriching original item embeddings with visual data from both intra-sequence and inter-sequence interactions. Additionally, we employ a multiple attention layer to capture cross-domain interests, enabling joint learning of single-domain and cross-domain user preferences. To validate the effectiveness of IFCDSR, we re-partitioned four e-commerce datasets and conducted extensive experiments. Results demonstrate that IFCDSR significantly outperforms existing methods.

Image Fusion for Cross-Domain Sequential Recommendation

TL;DR

This work tackles Cross-Domain Sequential Recommendation (CDSR) by addressing domain bias and the underutilization of visual item representations. It introduces IFCDSR, which fuses frozen CLIP derived image embeddings with learnable item ID embeddings and processes three cross-domain sequences through a multi-attention architecture to capture both intra-domain and cross-domain user preferences. The approach jointly learns single-domain and cross-domain interests, and predictions are made by combining ID and image based similarities in a fused probability. Experiments on re-partitioned Amazon CDSR datasets show IFCDSR achieving state-of-the-art results, demonstrating the practical benefit of incorporating visual signals into cross-domain recommendation systems.

Abstract

Cross-Domain Sequential Recommendation (CDSR) aims to predict future user interactions based on historical interactions across multiple domains. The key challenge in CDSR is effectively capturing cross-domain user preferences by fully leveraging both intra-sequence and inter-sequence item interactions. In this paper, we propose a novel method, Image Fusion for Cross-Domain Sequential Recommendation (IFCDSR), which incorporates item image information to better capture visual preferences. Our approach integrates a frozen CLIP model to generate image embeddings, enriching original item embeddings with visual data from both intra-sequence and inter-sequence interactions. Additionally, we employ a multiple attention layer to capture cross-domain interests, enabling joint learning of single-domain and cross-domain user preferences. To validate the effectiveness of IFCDSR, we re-partitioned four e-commerce datasets and conducted extensive experiments. Results demonstrate that IFCDSR significantly outperforms existing methods.

Paper Structure

This paper contains 15 sections, 9 equations, 2 figures, 4 tables.

Figures (2)

  • Figure 1: (a) In the traditional CDSR framework, the input consists solely of item ID features. (b) In our IFCDSR framework, we incorporate additional image information to complement the existing item features.
  • Figure 2: The overview of our proposed IFCDSR. Items from domains $X$ and $Y$ are embedded using a learnable ID-based matrix $E_{id}$ and a frozen CLIP image encoder $E_{img}$. The input sequence $\mathcal{S}$, comprising $S^X$, $S^Y$, and $S^{X+Y}$, is embedded for item ID and image features, then processed through multiple attention layers to capture intra- and inter-sequence relationships. Attention-aggregated embeddings are used with cosine similarity against $E_{id}$ and $E_{img}$ to predict the next item.