Table of Contents
Fetching ...

LLM-Enhanced Multimodal Fusion for Cross-Domain Sequential Recommendation

Wangyu Wu, Zhenhong Chen, Wenqiao Zhang, Xianglin Qiu, Siqi Song, Xiaowei Huang, Fei Ma, Jimin Xiao

TL;DR

LLM-Enhanced Multimodal Fusion for Cross-Domain Sequential Recommendation (LLM-EMF), a novel and advanced approach that enhances textual information with Large Language Models (LLM) knowledge and significantly improves recommendation performance through the fusion of visual and textual data.

Abstract

Cross-Domain Sequential Recommendation (CDSR) predicts user behavior by leveraging historical interactions across multiple domains, focusing on modeling cross-domain preferences and capturing both intra- and inter-sequence item relationships. We propose LLM-Enhanced Multimodal Fusion for Cross-Domain Sequential Recommendation (LLM-EMF), a novel and advanced approach that enhances textual information with Large Language Models (LLM) knowledge and significantly improves recommendation performance through the fusion of visual and textual data. Using the frozen CLIP model, we generate image and text embeddings, thereby enriching item representations with multimodal data. A multiple attention mechanism jointly learns both single-domain and cross-domain preferences, effectively capturing and understanding complex user interests across diverse domains. Evaluations conducted on four e-commerce datasets demonstrate that LLM-EMF consistently outperforms existing methods in modeling cross-domain user preferences, thereby highlighting the effectiveness of multimodal data integration and its advantages in enhancing sequential recommendation systems. Our source code will be released.

LLM-Enhanced Multimodal Fusion for Cross-Domain Sequential Recommendation

TL;DR

LLM-Enhanced Multimodal Fusion for Cross-Domain Sequential Recommendation (LLM-EMF), a novel and advanced approach that enhances textual information with Large Language Models (LLM) knowledge and significantly improves recommendation performance through the fusion of visual and textual data.

Abstract

Cross-Domain Sequential Recommendation (CDSR) predicts user behavior by leveraging historical interactions across multiple domains, focusing on modeling cross-domain preferences and capturing both intra- and inter-sequence item relationships. We propose LLM-Enhanced Multimodal Fusion for Cross-Domain Sequential Recommendation (LLM-EMF), a novel and advanced approach that enhances textual information with Large Language Models (LLM) knowledge and significantly improves recommendation performance through the fusion of visual and textual data. Using the frozen CLIP model, we generate image and text embeddings, thereby enriching item representations with multimodal data. A multiple attention mechanism jointly learns both single-domain and cross-domain preferences, effectively capturing and understanding complex user interests across diverse domains. Evaluations conducted on four e-commerce datasets demonstrate that LLM-EMF consistently outperforms existing methods in modeling cross-domain user preferences, thereby highlighting the effectiveness of multimodal data integration and its advantages in enhancing sequential recommendation systems. Our source code will be released.

Paper Structure

This paper contains 16 sections, 12 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: (a) Traditional CDSR uses only item ID features. (b) Our LLM-EMF incorporates image and title information, enriching item representations.
  • Figure 2: Overview of the proposed LLM-EMF framework. The Feature Preparation module generates ID-, image-, and text-based embeddings from domains $X$ and $Y$ using a learnable ID matrix, a frozen CLIP image encoder, and a text encoder. These embeddings are processed through multi-layer attention to model intra- and inter-sequence relationships, and cosine similarity with the embedding matrices is used for next-item prediction.
  • Figure 3: The prompt-and-generate pipeline of our method. In the showcase, the movie item is enhanced using the LLM with prompts to generate additional contextual information. As shown in the pipeline, the process begins by generating prompts, which are then input into the LLM. The output from the LLM consists of enhanced information, including key terms and a summary of the item, along with potential user interests. This enhanced information is subsequently used for the next step in text feature embedding.
  • Figure 4: The process of transforming a user sequence into a representation sequence. Initially, the user’s sequence of interactions is converted into an embedded representation. This sequence is then passed through an attention layer to capture both intra- and inter-sequence relationships. Finally, the attention-aggregated sequence representations are compared with item embeddings, and the item with the highest similarity score is selected as the predicted next item.
  • Figure 5: Impact of Hyperparameters on Final Performance on Movie Data