Table of Contents
Fetching ...

What, Whether and How? Unveiling Process Reward Models for Thinking with Images Reasoning

Yujin Zhou, Pengcheng Wen, Jiale Chen, Boqin Yin, Han Zhu, Jiaming Ji, Juntao Dai, Chi-Min Chan, Sirui Han

TL;DR

This paper tackles the challenge of evaluating process reward models (PRMs) for thinking with images LVLMs. It first identifies seven fine-grained error types by analyzing 7,558 reasoning trajectories and demonstrates the need for specialized PRMs, then introduces ThinkWithImages-PRMBench, a 1,206-trajectory multi-modal PRM benchmark with 4 categories and 16 subcategories. Through guided-search experiments, the authors show that current LVLMs can modestly improve PRM-guided reasoning but fall significantly short of human performance, with large disparities across error types and notable stepwise instability. The benchmark provides a foundation for advancing PRMs in LVLMs and highlights concrete directions for improving visual reasoning process supervision.

Abstract

The rapid advancement of Large Vision Language Models (LVLMs) has demonstrated excellent abilities in various visual tasks. Building upon these developments, the thinking with images paradigm has emerged, enabling models to dynamically edit and re-encode visual information at each reasoning step, mirroring human visual processing. However, this paradigm introduces significant challenges as diverse errors may occur during reasoning processes. This necessitates Process Reward Models (PRMs) for distinguishing positive and negative reasoning steps, yet existing benchmarks for PRMs are predominantly text-centric and lack comprehensive assessment under this paradigm. To address these gaps, this work introduces the first comprehensive benchmark specifically designed for evaluating PRMs under the thinking with images paradigm. Our main contributions are: (1) Through extensive analysis of reasoning trajectories and guided search experiments with PRMs, we define 7 fine-grained error types and demonstrate both the necessity for specialized PRMs and the potential for improvement. (2) We construct a comprehensive benchmark comprising 1,206 manually annotated thinking with images reasoning trajectories spanning 4 categories and 16 subcategories for fine-grained evaluation of PRMs. (3) Our experimental analysis reveals that current LVLMs fall short as effective PRMs, exhibiting limited capabilities in visual reasoning process evaluation with significant performance disparities across error types, positive evaluation bias, and sensitivity to reasoning step positions. These findings demonstrate the effectiveness of our benchmark and establish crucial foundations for advancing PRMs in LVLMs.

What, Whether and How? Unveiling Process Reward Models for Thinking with Images Reasoning

TL;DR

This paper tackles the challenge of evaluating process reward models (PRMs) for thinking with images LVLMs. It first identifies seven fine-grained error types by analyzing 7,558 reasoning trajectories and demonstrates the need for specialized PRMs, then introduces ThinkWithImages-PRMBench, a 1,206-trajectory multi-modal PRM benchmark with 4 categories and 16 subcategories. Through guided-search experiments, the authors show that current LVLMs can modestly improve PRM-guided reasoning but fall significantly short of human performance, with large disparities across error types and notable stepwise instability. The benchmark provides a foundation for advancing PRMs in LVLMs and highlights concrete directions for improving visual reasoning process supervision.

Abstract

The rapid advancement of Large Vision Language Models (LVLMs) has demonstrated excellent abilities in various visual tasks. Building upon these developments, the thinking with images paradigm has emerged, enabling models to dynamically edit and re-encode visual information at each reasoning step, mirroring human visual processing. However, this paradigm introduces significant challenges as diverse errors may occur during reasoning processes. This necessitates Process Reward Models (PRMs) for distinguishing positive and negative reasoning steps, yet existing benchmarks for PRMs are predominantly text-centric and lack comprehensive assessment under this paradigm. To address these gaps, this work introduces the first comprehensive benchmark specifically designed for evaluating PRMs under the thinking with images paradigm. Our main contributions are: (1) Through extensive analysis of reasoning trajectories and guided search experiments with PRMs, we define 7 fine-grained error types and demonstrate both the necessity for specialized PRMs and the potential for improvement. (2) We construct a comprehensive benchmark comprising 1,206 manually annotated thinking with images reasoning trajectories spanning 4 categories and 16 subcategories for fine-grained evaluation of PRMs. (3) Our experimental analysis reveals that current LVLMs fall short as effective PRMs, exhibiting limited capabilities in visual reasoning process evaluation with significant performance disparities across error types, positive evaluation bias, and sensitivity to reasoning step positions. These findings demonstrate the effectiveness of our benchmark and establish crucial foundations for advancing PRMs in LVLMs.
Paper Structure (26 sections, 6 figures, 3 tables, 1 algorithm)

This paper contains 26 sections, 6 figures, 3 tables, 1 algorithm.

Figures (6)

  • Figure 1: Identified Error Types in Thinking with Images Paradigm. We identify and categorize seven distinct error types from reasoning trajectories of current thinking with images models. Some are inherent LVLMs limitations and some are novel errors introduced by this paradigm.
  • Figure 2: Error Types Detection Procedure for Thinking with Images Paradigm. We collect extensive reasoning trajectories by deploying four thinking with images models across four benchmarks. Through systematic analysis and categorization of these trajectories, we identify seven distinct errors that commonly occur in thinking with images paradigm.
  • Figure 3: Performance Comparison of Thinking with Images Models with and without LVLMs as PRMs. Results demonstrate a general trend toward improved performance when using LVLMs as PRMs, though improvements vary across models.
  • Figure 4: Construction Pipeline for ThinkWithImages-PRMBench. Building upon the reasoning trajectories collected as described in Figure \ref{['fig:error_type']}, we perform step-by-step manual annotation for each trajectory and classify erroneous steps into seven error types. A comprehensive quality control process filters trajectories based on three key criteria: accuracy, consistency, and completeness, resulting in the curated ThinkWithImages-PRMBench.
  • Figure 5: Composition of ThinkWithImages-PRMBench.
  • ...and 1 more figures