Table of Contents
Fetching ...

Generative RLHF-V: Learning Principles from Multi-modal Human Preference

Jiayi Zhou, Jiaming Ji, Boyuan Chen, Jiapeng Sun, Wenqi Chen, Donghai Hong, Sirui Han, Yike Guo, Yaodong Yang

TL;DR

Generative RLHF-V addresses the core misalignment between traditional score-based rewards and human preferences in multi-modal LLMs by learning principled rewards through a generative reward model (GRM) trained with reinforcement learning. It then refines model policy using grouped comparisons to convert pairwise GRM judgments into precise, group-based rewards, enabling near-linear gains as the candidate set grows. Across seven benchmarks and four MLLMs, the framework delivers substantial improvements over baseline RLHF and demonstrates strong out-of-distribution generalization for discrimination tasks. The approach merges interpretability with scalability, though it also reveals reward-hacking risks under overfitting, highlighting the need for robust safeguards and benchmarking for future alignment work.

Abstract

Training multi-modal large language models (MLLMs) that align with human intentions is a long-term challenge. Traditional score-only reward models for alignment suffer from low accuracy, weak generalization, and poor interpretability, blocking the progress of alignment methods, e.g., reinforcement learning from human feedback (RLHF). Generative reward models (GRMs) leverage MLLMs' intrinsic reasoning capabilities to discriminate pair-wise responses, but their pair-wise paradigm makes it hard to generalize to learnable rewards. We introduce Generative RLHF-V, a novel alignment framework that integrates GRMs with multi-modal RLHF. We propose a two-stage pipeline: $\textbf{multi-modal generative reward modeling from RL}$, where RL guides GRMs to actively capture human intention, then predict the correct pair-wise scores; and $\textbf{RL optimization from grouped comparison}$, which enhances multi-modal RL scoring precision by grouped responses comparison. Experimental results demonstrate that, besides out-of-distribution generalization of RM discrimination, our framework improves 4 MLLMs' performance across 7 benchmarks by $18.1\%$, while the baseline RLHF is only $5.3\%$. We further validate that Generative RLHF-V achieves a near-linear improvement with an increasing number of candidate responses. Our code and models can be found at https://generative-rlhf-v.github.io.

Generative RLHF-V: Learning Principles from Multi-modal Human Preference

TL;DR

Generative RLHF-V addresses the core misalignment between traditional score-based rewards and human preferences in multi-modal LLMs by learning principled rewards through a generative reward model (GRM) trained with reinforcement learning. It then refines model policy using grouped comparisons to convert pairwise GRM judgments into precise, group-based rewards, enabling near-linear gains as the candidate set grows. Across seven benchmarks and four MLLMs, the framework delivers substantial improvements over baseline RLHF and demonstrates strong out-of-distribution generalization for discrimination tasks. The approach merges interpretability with scalability, though it also reveals reward-hacking risks under overfitting, highlighting the need for robust safeguards and benchmarking for future alignment work.

Abstract

Training multi-modal large language models (MLLMs) that align with human intentions is a long-term challenge. Traditional score-only reward models for alignment suffer from low accuracy, weak generalization, and poor interpretability, blocking the progress of alignment methods, e.g., reinforcement learning from human feedback (RLHF). Generative reward models (GRMs) leverage MLLMs' intrinsic reasoning capabilities to discriminate pair-wise responses, but their pair-wise paradigm makes it hard to generalize to learnable rewards. We introduce Generative RLHF-V, a novel alignment framework that integrates GRMs with multi-modal RLHF. We propose a two-stage pipeline: , where RL guides GRMs to actively capture human intention, then predict the correct pair-wise scores; and , which enhances multi-modal RL scoring precision by grouped responses comparison. Experimental results demonstrate that, besides out-of-distribution generalization of RM discrimination, our framework improves 4 MLLMs' performance across 7 benchmarks by , while the baseline RLHF is only . We further validate that Generative RLHF-V achieves a near-linear improvement with an increasing number of candidate responses. Our code and models can be found at https://generative-rlhf-v.github.io.

Paper Structure

This paper contains 17 sections, 6 equations, 14 figures, 5 tables.

Figures (14)

  • Figure 1: Advanced multi-modal large language models (MLLMs) is calling principled preference learning. In MLLM's alignment, traditional RLHF methods only learn scalar scores from preferences. In contrast, our Generative RLHF-V can learn principles from preferences and optimize based on this comprehensive comparison. Experimental results show that Generative RLHF-V elevates 2B and 3B MLLMs to 7B performance across 7 benchmarks. It also advances pretrained models to instruct model capabilities and enables open-source models to match closed-source experts.
  • Figure 1: Performance of GRMs in the MLLM-as-a-Judge Score task, measured by the Pearson correlation coefficient.
  • Figure 2: Comparison of our pipelines to traditional ones. For reward modeling, we make generative RM actively reason about the advantages and disadvantages between two answers, and output corresponding scores. If the better response gets a higher score, it provides a positive reward. For RL optimization, we compare responses in pairs within a group to obtain more accurate scores.
  • Figure 3: An example of generative reward modeling from RL. The goal of RL is to make MLLMs assign higher scores to responses that align with human preferences. Through RL optimization, MLLMs can infer the underlying principle behind how humans annotate these binary preferences.
  • Figure 4: An example of RL from grouped comparison. Its advantage lies in utilizing grouped comparisons to achieve more accurate scoring. Response B provides accurate and comprehensive information, thus receiving the highest score; although response A is somewhat arbitrary, it performs accurate image recognition and obtains a higher score than C and D.
  • ...and 9 more figures