Table of Contents
Fetching ...

MM-RLHF: The Next Step Forward in Multimodal LLM Alignment

Yi-Fan Zhang, Tao Yu, Haochen Tian, Chaoyou Fu, Peiyan Li, Jianshu Zeng, Wulin Xie, Yang Shi, Huanyu Zhang, Junkang Wu, Xue Wang, Yibo Hu, Bin Wen, Fan Yang, Zhang Zhang, Tingting Gao, Di Zhang, Liang Wang, Rong Jin, Tieniu Tan

TL;DR

The paper addresses the need for robust evaluation and alignment of multimodal LLMs in real-world settings. It introduces MME-RealWorld, the largest fully human-annotated, high-resolution benchmark across five domains and 43 tasks, plus a Chinese variant MME-RealWorld-CN. Through careful data collection, annotation, and quality control, it demonstrates substantial gaps in current MLLMs' perceptual and reasoning abilities, even with high-resolution inputs. Open-source contributions and findings emphasize the importance of realistic alignment data for advancing safe and capable multimodal agents.

Abstract

Despite notable advancements in Multimodal Large Language Models (MLLMs), most state-of-the-art models have not undergone thorough alignment with human preferences. This gap exists because current alignment research has primarily achieved progress in specific areas (e.g., hallucination reduction), while the broader question of whether aligning models with human preferences can systematically enhance MLLM capability remains largely unexplored. To this end, we introduce MM-RLHF, a dataset containing $\mathbf{120k}$ fine-grained, human-annotated preference comparison pairs. This dataset represents a substantial advancement over existing resources, offering superior size, diversity, annotation granularity, and quality. Leveraging this dataset, we propose several key innovations to improve both the quality of reward models and the efficiency of alignment algorithms. Notably, we introduce a Critique-Based Reward Model, which generates critiques of model outputs before assigning scores, offering enhanced interpretability and more informative feedback compared to traditional scalar reward mechanisms. Additionally, we propose Dynamic Reward Scaling, a method that adjusts the loss weight of each sample according to the reward signal, thereby optimizing the use of high-quality comparison pairs. Our approach is rigorously evaluated across $\mathbf{10}$ distinct dimensions and $\mathbf{27}$ benchmarks, with results demonstrating significant and consistent improvements in model performance. Specifically, fine-tuning LLaVA-ov-7B with MM-RLHF and our alignment algorithm leads to a $\mathbf{19.5}$% increase in conversational abilities and a $\mathbf{60}$% improvement in safety. We have open-sourced the preference dataset, reward model, training and evaluation code, as well as reward modeling and safety benchmarks. For more details, please visit our project page: https://mm-rlhf.github.io.

MM-RLHF: The Next Step Forward in Multimodal LLM Alignment

TL;DR

The paper addresses the need for robust evaluation and alignment of multimodal LLMs in real-world settings. It introduces MME-RealWorld, the largest fully human-annotated, high-resolution benchmark across five domains and 43 tasks, plus a Chinese variant MME-RealWorld-CN. Through careful data collection, annotation, and quality control, it demonstrates substantial gaps in current MLLMs' perceptual and reasoning abilities, even with high-resolution inputs. Open-source contributions and findings emphasize the importance of realistic alignment data for advancing safe and capable multimodal agents.

Abstract

Despite notable advancements in Multimodal Large Language Models (MLLMs), most state-of-the-art models have not undergone thorough alignment with human preferences. This gap exists because current alignment research has primarily achieved progress in specific areas (e.g., hallucination reduction), while the broader question of whether aligning models with human preferences can systematically enhance MLLM capability remains largely unexplored. To this end, we introduce MM-RLHF, a dataset containing fine-grained, human-annotated preference comparison pairs. This dataset represents a substantial advancement over existing resources, offering superior size, diversity, annotation granularity, and quality. Leveraging this dataset, we propose several key innovations to improve both the quality of reward models and the efficiency of alignment algorithms. Notably, we introduce a Critique-Based Reward Model, which generates critiques of model outputs before assigning scores, offering enhanced interpretability and more informative feedback compared to traditional scalar reward mechanisms. Additionally, we propose Dynamic Reward Scaling, a method that adjusts the loss weight of each sample according to the reward signal, thereby optimizing the use of high-quality comparison pairs. Our approach is rigorously evaluated across distinct dimensions and benchmarks, with results demonstrating significant and consistent improvements in model performance. Specifically, fine-tuning LLaVA-ov-7B with MM-RLHF and our alignment algorithm leads to a % increase in conversational abilities and a % improvement in safety. We have open-sourced the preference dataset, reward model, training and evaluation code, as well as reward modeling and safety benchmarks. For more details, please visit our project page: https://mm-rlhf.github.io.
Paper Structure (38 sections, 5 equations, 12 figures, 7 tables)

This paper contains 38 sections, 5 equations, 12 figures, 7 tables.

Figures (12)

  • Figure 1: MM-RLHF Construction Pipeline. (1) Data Collection and Cleaning: Starting with 10 million instruction samples, we cluster data based on image similarity, and uniformly sample across diverse categories. This results in a diverse dataset covering image-based Q&A (e.g., multiple-choice, dialogues, and safety-related questions) and video Q&A formats. (2) Response Generation: We leverage state-of-the-art models, including GPT-4o and Qwen2-VL-72B, to generate high-quality responses. (3) Human Annotation: We conduct manual annotation across nine categories, including scoring, ranking, and explanations, ensuring fine-grained evaluation.
  • Figure 2: Re-Sample results from the clustering process. Due to the large total number of samples, the clustered and deduplicated results contain a rich diversity of categories. Selected samples include topics such as mathematics, daily life, natural scenes, medicine, electronic technology, and OCR scenarios, showcasing a variety of problem-image pairs. The 2D features were obtained via UMAP dimensionality reduction.
  • Figure 3: Illustration of the multi-task reward model training process. The process begins with a user query and corresponding model responses, which are ranked and annotated by humans. Human annotations are expanded using GPT-4o to provide enhanced rationales. The reward model is trained with two objectives: (1) Learning to Provide Critique, where the model learns to provide detailed critiques and evaluations for model responses, and (2) Learning Scoring, where the model learns to assign scores based on the model response and critique. The integration of these tasks ensures a robust evaluation framework for improving model outputs.
  • Figure 4: Overview of the MM-DPO framework, The dynamic reward scaling mechanism adjusts the update strength based on the reward margin, improving optimization stability and robustness.
  • Figure 5: Effect of $k$ on $1 - e^{-k \delta}$.
  • ...and 7 more figures