Table of Contents
Fetching ...

DMESR: Dual-view MLLM-based Enhancing Framework for Multimodal Sequential Recommendation

Mingyao Huang, Qidong Liu, Wenxuan Yang, Moranxin Wang, Yuqi Sun, Haiping Zhu, Feng Tian, Yan Chen

TL;DR

Sequential recommender systems suffer from data sparsity, and while multimodal large language models (MLLMs) offer rich semantic cues, existing approaches struggle with cross-modal misalignment and loss of fine-grained textual semantics. DMESR proposes a dual-view framework that combines a contrastive alignment module for cross-modal representations generated by a three-way prompting framework with a bidirectional cross-attention fusion module that blends coarse MLLM semantics with fine-grained item texts. These fused representations can be plugged into downstream SRS backbones, enabling flexible integration across models. Empirical results on three real-world datasets and three popular architectures demonstrate improved performance and strong generalizability, validating the practicality of leveraging MLLMs for multimodal sequential recommendation.

Abstract

Sequential Recommender Systems (SRS) aim to predict users' next interaction based on their historical behaviors, while still facing the challenge of data sparsity. With the rapid advancement of Multimodal Large Language Models (MLLMs), leveraging their multimodal understanding capabilities to enrich item semantic representation has emerged as an effective enhancement strategy for SRS. However, existing MLLM-enhanced recommendation methods still suffer from two key limitations. First, they struggle to effectively align multimodal representations, leading to suboptimal utilization of semantic information across modalities. Second, they often overly rely on MLLM-generated content while overlooking the fine-grained semantic cues contained in the original textual data of items. To address these issues, we propose a Dual-view MLLM-based Enhancing framework for multimodal Sequential Recommendation (DMESR). For the misalignment issue, we employ a contrastive learning mechanism to align the cross-modal semantic representations generated by MLLMs. For the loss of fine-grained semantics, we introduce a cross-attention fusion module that integrates the coarse-grained semantic knowledge obtained from MLLMs with the fine-grained original textual semantics. Finally, these two fused representations can be seamlessly integrated into the downstream sequential recommendation models. Extensive experiments conducted on three real-world datasets and three popular sequential recommendation architectures demonstrate the superior effectiveness and generalizability of our proposed approach.

DMESR: Dual-view MLLM-based Enhancing Framework for Multimodal Sequential Recommendation

TL;DR

Sequential recommender systems suffer from data sparsity, and while multimodal large language models (MLLMs) offer rich semantic cues, existing approaches struggle with cross-modal misalignment and loss of fine-grained textual semantics. DMESR proposes a dual-view framework that combines a contrastive alignment module for cross-modal representations generated by a three-way prompting framework with a bidirectional cross-attention fusion module that blends coarse MLLM semantics with fine-grained item texts. These fused representations can be plugged into downstream SRS backbones, enabling flexible integration across models. Empirical results on three real-world datasets and three popular architectures demonstrate improved performance and strong generalizability, validating the practicality of leveraging MLLMs for multimodal sequential recommendation.

Abstract

Sequential Recommender Systems (SRS) aim to predict users' next interaction based on their historical behaviors, while still facing the challenge of data sparsity. With the rapid advancement of Multimodal Large Language Models (MLLMs), leveraging their multimodal understanding capabilities to enrich item semantic representation has emerged as an effective enhancement strategy for SRS. However, existing MLLM-enhanced recommendation methods still suffer from two key limitations. First, they struggle to effectively align multimodal representations, leading to suboptimal utilization of semantic information across modalities. Second, they often overly rely on MLLM-generated content while overlooking the fine-grained semantic cues contained in the original textual data of items. To address these issues, we propose a Dual-view MLLM-based Enhancing framework for multimodal Sequential Recommendation (DMESR). For the misalignment issue, we employ a contrastive learning mechanism to align the cross-modal semantic representations generated by MLLMs. For the loss of fine-grained semantics, we introduce a cross-attention fusion module that integrates the coarse-grained semantic knowledge obtained from MLLMs with the fine-grained original textual semantics. Finally, these two fused representations can be seamlessly integrated into the downstream sequential recommendation models. Extensive experiments conducted on three real-world datasets and three popular sequential recommendation architectures demonstrate the superior effectiveness and generalizability of our proposed approach.
Paper Structure (4 sections)

This paper contains 4 sections.