Table of Contents
Fetching ...

VLRMBench: A Comprehensive and Challenging Benchmark for Vision-Language Reward Models

Jiacheng Ruan, Wenzhen Yuan, Xian Gao, Ye Guo, Daoxin Zhang, Zhe Xu, Yao Hu, Ting Liu, Yuzhuo Fu

TL;DR

VLRMBench introduces a comprehensive benchmark for vision-language reward models, addressing process understanding, outcome judgment, and critique generation across 12 tasks and 12,634 questions. It employs a three-stage data pipeline to curate high-quality reasoning traces from math, hallucination, and multi-image domains, enabling fine-grained evaluation of stepwise reasoning and final outcomes. Key findings show that even large open- and closed-source systems struggle with long reasoning, spatial and cross-image errors, and instruction-following robustness, underscoring the need for specialized VLRM training and inference strategies such as test-time scaling and feedback loops. The work provides a standardized platform and insights to drive future development of VLRMs and their integration with LVLMs for improved reasoning and error correction.

Abstract

Although large visual-language models (LVLMs) have demonstrated strong performance in multimodal tasks, errors may occasionally arise due to biases during the reasoning process. Recently, reward models (RMs) have become increasingly pivotal in the reasoning process. Specifically, process RMs evaluate each reasoning step, outcome RMs focus on the assessment of reasoning results, and critique RMs perform error analysis on the entire reasoning process, followed by corrections. However, existing benchmarks for vision-language RMs (VLRMs) typically assess only a single aspect of their capabilities (e.g., distinguishing between two answers), thus limiting the all-round evaluation and restricting the development of RMs in the visual-language domain. To address this gap, we propose a comprehensive and challenging benchmark, dubbed as VLRMBench, encompassing 12,634 questions. VLRMBench is constructed based on three distinct types of datasets, covering mathematical reasoning, hallucination understanding, and multi-image understanding. We design 12 tasks across three major categories, focusing on evaluating VLRMs in the aspects of process understanding, outcome judgment, and critique generation. Extensive experiments are conducted on 21 open-source models and 5 advanced closed-source models, highlighting the challenges posed by VLRMBench. For instance, in the `Forecasting Future', a binary classification task, the advanced GPT-4o achieves only a 76.0% accuracy. Additionally, we perform comprehensive analytical studies, offering valuable insights for the future development of VLRMs. We anticipate that VLRMBench will serve as a pivotal benchmark in advancing VLRMs. Code and datasets will be available at https://github.com/JCruan519/VLRMBench.

VLRMBench: A Comprehensive and Challenging Benchmark for Vision-Language Reward Models

TL;DR

VLRMBench introduces a comprehensive benchmark for vision-language reward models, addressing process understanding, outcome judgment, and critique generation across 12 tasks and 12,634 questions. It employs a three-stage data pipeline to curate high-quality reasoning traces from math, hallucination, and multi-image domains, enabling fine-grained evaluation of stepwise reasoning and final outcomes. Key findings show that even large open- and closed-source systems struggle with long reasoning, spatial and cross-image errors, and instruction-following robustness, underscoring the need for specialized VLRM training and inference strategies such as test-time scaling and feedback loops. The work provides a standardized platform and insights to drive future development of VLRMs and their integration with LVLMs for improved reasoning and error correction.

Abstract

Although large visual-language models (LVLMs) have demonstrated strong performance in multimodal tasks, errors may occasionally arise due to biases during the reasoning process. Recently, reward models (RMs) have become increasingly pivotal in the reasoning process. Specifically, process RMs evaluate each reasoning step, outcome RMs focus on the assessment of reasoning results, and critique RMs perform error analysis on the entire reasoning process, followed by corrections. However, existing benchmarks for vision-language RMs (VLRMs) typically assess only a single aspect of their capabilities (e.g., distinguishing between two answers), thus limiting the all-round evaluation and restricting the development of RMs in the visual-language domain. To address this gap, we propose a comprehensive and challenging benchmark, dubbed as VLRMBench, encompassing 12,634 questions. VLRMBench is constructed based on three distinct types of datasets, covering mathematical reasoning, hallucination understanding, and multi-image understanding. We design 12 tasks across three major categories, focusing on evaluating VLRMs in the aspects of process understanding, outcome judgment, and critique generation. Extensive experiments are conducted on 21 open-source models and 5 advanced closed-source models, highlighting the challenges posed by VLRMBench. For instance, in the `Forecasting Future', a binary classification task, the advanced GPT-4o achieves only a 76.0% accuracy. Additionally, we perform comprehensive analytical studies, offering valuable insights for the future development of VLRMs. We anticipate that VLRMBench will serve as a pivotal benchmark in advancing VLRMs. Code and datasets will be available at https://github.com/JCruan519/VLRMBench.

Paper Structure

This paper contains 31 sections, 1 equation, 4 figures, 7 tables.

Figures (4)

  • Figure 1: The collaborative data filtering and generation pipeline of VLRMBench. During the filtering stage, Qwen2VL-7B is tasked with answering questions under two conditions: without and with image input. Questions for which the model provides incorrect answers are retained. Subsequently, QVQ-72B-preview and GPT-4o are employed to generate reasoning processes and construct tasks for three themes.
  • Figure 2: The distribution of the number of reasoning steps for different data sources.
  • Figure 3: Specific input examples for VLRMBench. Left: An example of the SC task. VLRMs are required to understand each reasoning step and output a sequence, where '1' indicates the presence of erroneous information in the current reasoning step, and '0' indicates its absence. Right: An example of the FF task. VLRMs are tasked with assessing the correctness of the final result based on the reasoning process of the preceding steps.
  • Figure 4: Performance changes with the test-time scaling.