RoVRM: A Robust Visual Reward Model Optimized via Auxiliary Textual Preference Data

Chenglong Wang; Yang Gan; Yifu Huo; Yongyu Mu; Murun Yang; Qiaozhi He; Tong Xiao; Chunliang Zhang; Tongran Liu; Quan Du; Di Yang; Jingbo Zhu

RoVRM: A Robust Visual Reward Model Optimized via Auxiliary Textual Preference Data

Chenglong Wang, Yang Gan, Yifu Huo, Yongyu Mu, Murun Yang, Qiaozhi He, Tong Xiao, Chunliang Zhang, Tongran Liu, Quan Du, Di Yang, Jingbo Zhu

TL;DR

RoVRM tackles the scarcity of visual preference data for LVLM alignment by transferring rich textual preferences to the visual domain through a three-phase progressive training regimen and optimal transport–based data selection. The approach builds a robust visual reward signal that improves alignment under best-of-n sampling and reinforcement learning, with additional benefits when integrated with direct preference optimization. Key findings include substantial improvements over vanilla VRMs across multiple vision-language benchmarks, reduced hallucination, and demonstrated few-shot transfer capabilities for VRMs. The work offers a practical pathway to stronger, data-efficient human-preference alignment in LVLMs, with broad implications for safer and more reliable visual reasoning in multimodal systems.

Abstract

Large vision-language models (LVLMs) often fail to align with human preferences, leading to issues like generating misleading content without proper visual context (also known as hallucination). A promising solution to this problem is using human-preference alignment techniques, such as best-of-n sampling and reinforcement learning. However, these techniques face the difficulty arising from the scarcity of visual preference data, which is required to train a visual reward model (VRM). In this work, we continue the line of research. We present a Robust Visual Reward Model (RoVRM) which improves human-preference alignment for LVLMs. RoVRM leverages auxiliary textual preference data through a three-phase progressive training and optimal transport-based preference data selection to effectively mitigate the scarcity of visual preference data. We experiment with RoVRM on the commonly used vision-language tasks based on the LLaVA-1.5-7B and -13B models. Experimental results demonstrate that RoVRM consistently outperforms traditional VRMs. Furthermore, our three-phase progressive training and preference data selection approaches can yield consistent performance gains over ranking-based alignment techniques, such as direct preference optimization.

RoVRM: A Robust Visual Reward Model Optimized via Auxiliary Textual Preference Data

TL;DR

Abstract

Paper Structure (48 sections, 8 equations, 8 figures, 9 tables)

This paper contains 48 sections, 8 equations, 8 figures, 9 tables.

Introduction
Related Work
Large Vision-Language Models
Human-Preference Alignment for LVLMs
Our Method
Preliminaries
Reinforcement Learning with Human Feedback
Direct Preference Optimization
Best-of-$n$ Sampling
A Robust Visual Reward Model
Three-Phase Progressive Training
Preference Data Selection
Experiments
Experimental Setups
Datasets
...and 33 more sections

Figures (8)

Figure 1: We propose three-phase progressive training and optimal transport-based preference data selection approaches to train RoVRM. For three-phase progressive training, we take full advantage of textual preference data to compensate for the limited availability of visual preference data. Using this preference selection, samples for phases one and two are selected based on those selected for the subsequent phase. ✓ denotes a selected sample, while ✗ denotes one that is not selected.
Figure 2: We train RoVRM with varying amounts of textual and image caption preference data. Experiments are conducted on the LLaVA-1.5-7B model using three different seeds, and we report the average results along with their standard deviation.
Figure 3: Performance during RL training is evaluated on the MMHalBench (left) and LLaVA-Bench (right) benchmarks using three different seeds.
Figure 4: Performance of best-of-$n$ sampling (BoS) and RL on MMHalBench (left) and LLaVA-Bench (right) across three different seeds. The RoVRM model is trained with varying amounts of visual preference data (VPD): 0k, 1k, 5k, 10k, 20k, 30k, and 40k.
Figure 5: Performance of best-of-$n$ sampling (BoS) with different sampling sizes: 4, 8, 16, and 32.
...and 3 more figures

RoVRM: A Robust Visual Reward Model Optimized via Auxiliary Textual Preference Data

TL;DR

Abstract

RoVRM: A Robust Visual Reward Model Optimized via Auxiliary Textual Preference Data

Authors

TL;DR

Abstract

Table of Contents

Figures (8)