Table of Contents
Fetching ...

Multimodal Large Language Models with Adaptive Preference Optimization for Sequential Recommendation

Yu Wang, Yonghui Yang, Le Wu, Yi Zhang, Richang Hong

TL;DR

A Multimodal LLM framework that integrates Hardness-aware and Noise-regularized preference optimization for Recommendation (HaNoRec) is proposed, which dynamically adjusts optimization weights based on both the estimated hardness of each training sample and the policy model's real-time responsiveness, prioritizing harder examples.

Abstract

Recent advances in Large Language Models (LLMs) have opened new avenues for sequential recommendation by enabling natural language reasoning over user behavior sequences. A common approach formulates recommendation as a language modeling task, where interaction histories are transformed into prompts and user preferences are learned via supervised fine-tuning. However, these methods operate solely in the textual modality and often miss users' fine-grained interests, especially when shaped by rich visual signals such as product images or movie posters. Multimodal Large Language Models (MLLMs) offer a promising alternative by aligning text and vision in a shared semantic space. A prevalent training paradigm applies Supervised Fine-Tuning (SFT) followed by Direct Preference Optimization (DPO) to model user preferences. Yet, two core challenges remain: 1) Imbalanced sample hardness, where random negative sampling causes overfitting on easy examples and under-training on hard ones; 2) Cross-modal semantic bias, where the fixed reference model in DPO prevents the policy model from correcting modality misalignments--especially over long sequences. To address these issues, we propose a Multimodal LLM framework that integrates Hardness-aware and Noise-regularized preference optimization for Recommendation (HaNoRec). Specifically, HaNoRec dynamically adjusts optimization weights based on both the estimated hardness of each training sample and the policy model's real-time responsiveness, prioritizing harder examples. It further introduces Gaussian-perturbed distribution optimization on output logits to enhance cross-modal semantic consistency and reduce modality bias inherited from the reference model.

Multimodal Large Language Models with Adaptive Preference Optimization for Sequential Recommendation

TL;DR

A Multimodal LLM framework that integrates Hardness-aware and Noise-regularized preference optimization for Recommendation (HaNoRec) is proposed, which dynamically adjusts optimization weights based on both the estimated hardness of each training sample and the policy model's real-time responsiveness, prioritizing harder examples.

Abstract

Recent advances in Large Language Models (LLMs) have opened new avenues for sequential recommendation by enabling natural language reasoning over user behavior sequences. A common approach formulates recommendation as a language modeling task, where interaction histories are transformed into prompts and user preferences are learned via supervised fine-tuning. However, these methods operate solely in the textual modality and often miss users' fine-grained interests, especially when shaped by rich visual signals such as product images or movie posters. Multimodal Large Language Models (MLLMs) offer a promising alternative by aligning text and vision in a shared semantic space. A prevalent training paradigm applies Supervised Fine-Tuning (SFT) followed by Direct Preference Optimization (DPO) to model user preferences. Yet, two core challenges remain: 1) Imbalanced sample hardness, where random negative sampling causes overfitting on easy examples and under-training on hard ones; 2) Cross-modal semantic bias, where the fixed reference model in DPO prevents the policy model from correcting modality misalignments--especially over long sequences. To address these issues, we propose a Multimodal LLM framework that integrates Hardness-aware and Noise-regularized preference optimization for Recommendation (HaNoRec). Specifically, HaNoRec dynamically adjusts optimization weights based on both the estimated hardness of each training sample and the policy model's real-time responsiveness, prioritizing harder examples. It further introduces Gaussian-perturbed distribution optimization on output logits to enhance cross-modal semantic consistency and reduce modality bias inherited from the reference model.

Paper Structure

This paper contains 30 sections, 12 equations, 6 figures, 3 tables, 1 algorithm.

Figures (6)

  • Figure 1: (a) Existing paradigms and performance comparisons of large model-based sequential recommendation; (b) Data construction and sample hardness issues in MLLM-SR during DPO training; (c) Title-image similarity scores exhibit a "the weak get weaker" trend due to MLLM modality bias after DPO training.
  • Figure 2: Illustration of HaNoRec. Hardness-aware Reweighting Strategy (HaRS) scales DPO weight proportionally to sample hardness, spotlighting challenging cases. Noise-regularized Distribution Optimization (NoDO) adds Gaussian noise and adaptive KL to align title–image semantics and curb modality bias.
  • Figure 3: Ablation studies of model variants on the all datasets w.r.t. HR@3 and NDCG@3.
  • Figure 4: Case study on Movielens for user 4126.
  • Figure 5: Performance comparison w.r.t. different hyperparameters. The $y$-axes on the left belong to the metrics AUC and HR and the right belong to NDCG.
  • ...and 1 more figures