Table of Contents
Fetching ...

VRAgent-R1: Boosting Video Recommendation with MLLM-based Agents via Reinforcement Learning

Siran Chen, Boyu Chen, Chenyun Yu, Yuxiao Luo, Ouyang Yi, Lei Cheng, Chengxiang Zhuo, Zang Li, Yali Wang

TL;DR

VRAgent-R1 introduces a two-agent framework (IP Agent and US Agent) that leverages multimodal large language models and reinforcement fine-tuning to enhance video recommendations. The IP Agent builds deep multimodal item representations through progressive analysis of video frames and titles, while the US Agent simulates user decisions via chain-of-thought reasoning and GRPO-based training. Empirical results on MicroLens-100k show meaningful improvements in ranking metrics and cold-start performance, and US Agent demonstrates strong, generalizable user-simulation accuracy, even on MovieLens-1M. The work highlights the potential of combining interpretable LLM-based reasoning with RL for more accurate and human-like recommendations, with scope for extension to other domains and richer user behaviors.

Abstract

Owing to powerful natural language processing and generative capabilities, large language model (LLM) agents have emerged as a promising solution for enhancing recommendation systems via user simulation. However, in the realm of video recommendation, existing studies predominantly resort to prompt-based simulation using frozen LLMs and encounter the intricate challenge of multimodal content understanding. This frequently results in suboptimal item modeling and user preference learning, thereby ultimately constraining recommendation performance. To address these challenges, we introduce VRAgent-R1, a novel agent-based paradigm that incorporates human-like intelligence in user simulation. Specifically, VRAgent-R1 comprises two distinct agents: the Item Perception (IP) Agent and the User Simulation (US) Agent, designed for interactive user-item modeling. Firstly, the IP Agent emulates human-like progressive thinking based on MLLMs, effectively capturing hidden recommendation semantics in videos. With a more comprehensive multimodal content understanding provided by the IP Agent, the video recommendation system is equipped to provide higher-quality candidate items. Subsequently, the US Agent refines the recommended video sets based on in-depth chain-of-thought (CoT) reasoning and achieves better alignment with real user preferences through reinforcement learning. Experimental results on a large-scale video recommendation benchmark have demonstrated the effectiveness of our proposed VRAgent-R1 method, e.g., the IP Agent achieves a 6.0\% improvement in NDCG@10 on the MicroLens-100k dataset, while the US Agent shows approximately 45.0\% higher accuracy in user decision simulation compared to state-of-the-art baselines.

VRAgent-R1: Boosting Video Recommendation with MLLM-based Agents via Reinforcement Learning

TL;DR

VRAgent-R1 introduces a two-agent framework (IP Agent and US Agent) that leverages multimodal large language models and reinforcement fine-tuning to enhance video recommendations. The IP Agent builds deep multimodal item representations through progressive analysis of video frames and titles, while the US Agent simulates user decisions via chain-of-thought reasoning and GRPO-based training. Empirical results on MicroLens-100k show meaningful improvements in ranking metrics and cold-start performance, and US Agent demonstrates strong, generalizable user-simulation accuracy, even on MovieLens-1M. The work highlights the potential of combining interpretable LLM-based reasoning with RL for more accurate and human-like recommendations, with scope for extension to other domains and richer user behaviors.

Abstract

Owing to powerful natural language processing and generative capabilities, large language model (LLM) agents have emerged as a promising solution for enhancing recommendation systems via user simulation. However, in the realm of video recommendation, existing studies predominantly resort to prompt-based simulation using frozen LLMs and encounter the intricate challenge of multimodal content understanding. This frequently results in suboptimal item modeling and user preference learning, thereby ultimately constraining recommendation performance. To address these challenges, we introduce VRAgent-R1, a novel agent-based paradigm that incorporates human-like intelligence in user simulation. Specifically, VRAgent-R1 comprises two distinct agents: the Item Perception (IP) Agent and the User Simulation (US) Agent, designed for interactive user-item modeling. Firstly, the IP Agent emulates human-like progressive thinking based on MLLMs, effectively capturing hidden recommendation semantics in videos. With a more comprehensive multimodal content understanding provided by the IP Agent, the video recommendation system is equipped to provide higher-quality candidate items. Subsequently, the US Agent refines the recommended video sets based on in-depth chain-of-thought (CoT) reasoning and achieves better alignment with real user preferences through reinforcement learning. Experimental results on a large-scale video recommendation benchmark have demonstrated the effectiveness of our proposed VRAgent-R1 method, e.g., the IP Agent achieves a 6.0\% improvement in NDCG@10 on the MicroLens-100k dataset, while the US Agent shows approximately 45.0\% higher accuracy in user decision simulation compared to state-of-the-art baselines.

Paper Structure

This paper contains 20 sections, 9 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: Qualitative examples of VRAgent-R1 for Video Recommendation. Our VRAgent-R1 stands out from previous supervised fine-tuning of MLLM, which fails to give the correct prediction due to the lack of understanding of video items and deep thinking on user status.
  • Figure 2: Overview of our VRAgent-R1 framework. We propose a framework with two novel agents for better video recommendation. The IP Agent conducts collaborative multimodal understanding to obtain enhanced video features for the recommendation system and the US Agent. Meanwhile, the US Agent simulates user behavior via deep CoT reasoning based on user status. By reinforcement learning with actual behavior rewards, VRAgent-R1 achieves superior simulation performance and helps improve the recommendation accuracy.
  • Figure 3: Video understanding by the IP Agent. We simulate the human video comprehension process through a progressive approach involving retrieval, collaborative perception, and analysis, so as to obtain a summary of the key video information that is applicable for recommendation.
  • Figure 4: User Group Performance. Our method has a great advantage in modeling cold-start users due to the good multimodal understanding, outperforming the baseline by more than 10%.
  • Figure 5: Comparison with other user simulation for recommendation.
  • ...and 1 more figures