MLLMRec-R1: Incentivizing Reasoning Capability in Large Language Models for Multimodal Sequential Recommendation

Yu Wang; Yonghui Yang; Le Wu; Jiancan Wu; Hefei Xu; Hui Lin

MLLMRec-R1: Incentivizing Reasoning Capability in Large Language Models for Multimodal Sequential Recommendation

Yu Wang, Yonghui Yang, Le Wu, Jiancan Wu, Hefei Xu, Hui Lin

TL;DR

MLLMRec-R1 is proposed, an efficient and stable GRPO-based reasoning framework for multimodal sequential recommendation that textualizes visual signals offline to eliminate expensive visual tokens while preserving multimodal semantics, and constructs high-quality multimodal CoT supervision through refinement and confidence-aware assessment.

Abstract

Group relative policy optimization (GRPO) has become a standard post-training paradigm for improving reasoning and preference alignment in large language models (LLMs), and has recently shown strong effectiveness in LLM-based recommender systems. However, extending GRPO-based reasoning pipelines to multimodal sequential recommendation (MSR) with multimodal large language models (MLLMs) faces fundamental obstacles. First, MSR requires jointly encoding visual content for both historical interactions and multiple candidate items, causing visual tokens to dominate the input and making the cost of group-based rollout scale with history length and candidate set size, which renders GRPO-based training prohibitively expensive. Second, existing Chain-of-Thought (CoT) supervision suffers from reward inflation in recommendation scenarios, where higher training rewards do not reliably translate into improved ranking performance and may induce shortcut learning. To address these challenges, we propose MLLMRec-R1, an efficient and stable GRPO-based reasoning framework for multimodal sequential recommendation. MLLMRec-R1 textualizes visual signals offline to eliminate expensive visual tokens while preserving multimodal semantics, and constructs high-quality multimodal CoT supervision through refinement and confidence-aware assessment. Furthermore, a mixed-grained data augmentation strategy selectively injects reliable CoT samples while retaining standard training data, mitigating reward inflation and improving generalization stability. Extensive experiments on three benchmark datasets demonstrate that MLLMRec-R1 consistently outperforms state-of-the-art methods, establishing a practical and effective GRPO-based reasoning pipeline for multimodal sequential recommendation. The code is available at https://github.com/wangyu0627/MLLMRec-R1.

MLLMRec-R1: Incentivizing Reasoning Capability in Large Language Models for Multimodal Sequential Recommendation

TL;DR

Abstract

Paper Structure (37 sections, 8 equations, 10 figures, 5 tables, 1 algorithm)

This paper contains 37 sections, 8 equations, 10 figures, 5 tables, 1 algorithm.

Introduction
Preliminary
Problem Formulation
Supervised Fine-Tuning (SFT)
Group Relative Policy Optimization (GRPO)
Methodology
Multimodal Chain-of-Thought (CoT)
Mixed-grained Data Augmentation
Lightweight Reward Rules
Model Optimization with GRPO
Experiments
Experiment Setup
Datasets
Baselines
Metrics
...and 22 more sections

Figures (10)

Figure 1: (a) Visual tokens make compute grow with history length and candidate set size, yet often yield limited gains over LLMs, revealing an efficiency bottleneck. (b) Some CoT data may boost training reward scores but hurt test performance, revealing shortcut learning and poor generalization.
Figure 2: Multimodal CoT construction pipeline. (1) Caption generation. We use an MLLM to produce fine-grained captions for each item image, compressing costly visual signals into reusable text offline. (2) Pseudo-CoT construction. Given the interaction history, we prompt the MLLM to generate structured CoT rationales that surface key visual cues in text beyond plain captions. (3) CoT refinement. We combine these visual cues with the reasoning trace and input them to DeepSeek-R1 2025-deepseek-r1 to obtain high-quality multimodal CoT supervision.
Figure 3: we compute modality consistency between title–image pairs and prediction consistency between the refined CoT profile and the target item, then mix filtered CoT refinement into the instruction data.
Figure 4: GRPO performs group sampling under the same context and applies reward rules with format check and hit check to provide stable training signals.
Figure 5: Performance comparison of different Backbone sizes on the all dataset w.r.t. HR@3 and NDCG@3.
...and 5 more figures

MLLMRec-R1: Incentivizing Reasoning Capability in Large Language Models for Multimodal Sequential Recommendation

TL;DR

Abstract

MLLMRec-R1: Incentivizing Reasoning Capability in Large Language Models for Multimodal Sequential Recommendation

Authors

TL;DR

Abstract

Table of Contents

Figures (10)