Multimodal Pre-training Framework for Sequential Recommendation via Contrastive Learning

Lingzi Zhang; Xin Zhou; Zhiwei Zeng; Zhiqi Shen

Multimodal Pre-training Framework for Sequential Recommendation via Contrastive Learning

Lingzi Zhang, Xin Zhou, Zhiwei Zeng, Zhiqi Shen

TL;DR

MP4SR addresses the challenge of leveraging multimodal data in sequential recommendation by introducing a multimodal pre-training framework that aligns user and item modality sequences through contrastive learning. It combines multimodal feature extraction, a backbone M^2SE (Multimodal Mixup Sequence Encoder), and two pre-training objectives—modality-specific next-item prediction and cross-modality contrastive learning—into a cohesive training recipe with a balancing loss $\lambda$. Empirical results on four real-world datasets show MP4SR consistently outperforms state-of-the-art baselines in both normal and cold-start scenarios, with ablations confirming the value of each component. The study also highlights that multimodal pre-training acts as a regularizer, improving optimization and generalization, and establishes a foundation for further exploration of multimodal signals in sequential recommendation.

Abstract

Current multimodal sequential recommendation models are often unable to effectively explore and capture correlations among behavior sequences of users and items across different modalities, either neglecting correlations among sequence representations or inadequately capturing associations between multimodal data and sequence data in their representations. To address this problem, we explore multimodal pre-training in the context of sequential recommendation, with the aim of enhancing fusion and utilization of multimodal information. We propose a novel Multimodal Pre-training for Sequential Recommendation (MP4SR) framework, which utilizes contrastive losses to capture the correlation among different modality sequences of users, as well as the correlation among different modality sequences of users and items. MP4SR consists of three key components: 1) multimodal feature extraction, 2) a backbone network, Multimodal Mixup Sequence Encoder (M2SE), and 3) pre-training tasks. After utilizing pre-trained encoders to generate initial multimodal features of items, M2SE adopts a complementary sequence mixup strategy to fuse different modality sequences, and leverages contrastive learning to capture modality interactions at the sequence-to-sequence and sequence-to-item levels. Extensive experiments on four real-world datasets demonstrate that MP4SR outperforms state-of-the-art approaches in both normal and cold-start settings. We further highlight the efficacy of incorporating multimodal pre-training in sequential recommendation representation learning, serving as an effective regularizer and optimizing the parameter space for the recommendation task.

Multimodal Pre-training Framework for Sequential Recommendation via Contrastive Learning

TL;DR

. Empirical results on four real-world datasets show MP4SR consistently outperforms state-of-the-art baselines in both normal and cold-start scenarios, with ablations confirming the value of each component. The study also highlights that multimodal pre-training acts as a regularizer, improving optimization and generalization, and establishes a foundation for further exploration of multimodal signals in sequential recommendation.

Abstract

Paper Structure (38 sections, 13 equations, 8 figures, 6 tables)

This paper contains 38 sections, 13 equations, 8 figures, 6 tables.

Introduction
Related Work
Sequential Recommendation
Multimodal Recommendation
Multimodal Pre-training
Methodology
Notations
Multimodal Feature Extraction
Text Feature Extraction
Image Feature Extraction
Multimodal Mixup Sequence Encoder
Sequence Random Dropout
Text and Image Encoders
Complementary Sequence Mixup
Transformer Layers
...and 23 more sections

Figures (8)

Figure 1: Overall framework of the proposed method MP4SR, which consists of three main components: (a) The multimodal feature extraction module used to obtain initial multimodal features of items. (b) The structure of the proposed multimodal mixup sequence encoder that fuses items' multimodal content with users' behavior sequence. (c) The workflow of the proposed pre-training framework, where $\mathcal{S}$ is the input sequence and $i_{n+1}$ is the target item.
Figure 2: Two examples of converting images of an item into text tokens. Items are retrieved from the Amazon Pantry and Arts dataset. Text tokens are generated using CLIP radford2021learning.
Figure 3: Evolution without pre-training (blue) and with pre-training (orange) on Pantry and Office datasets of the log of the test loss plotted against the log of the train loss as training proceeds. Each group has 5 curves representing a different initialization. During training, the trajectories move from right (high error) to left (low error) due to the decrease in training error.
Figure 4: The performance trends of MP4SR with respect to different settings of $\lambda$ on Pantry and Office datasets.
Figure 5: The performance trends of MP4SR with respect to different settings of $N$ on Pantry and Office datasets.
...and 3 more figures

Multimodal Pre-training Framework for Sequential Recommendation via Contrastive Learning

TL;DR

Abstract

Multimodal Pre-training Framework for Sequential Recommendation via Contrastive Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (8)