Table of Contents
Fetching ...

MPBench: A Comprehensive Multimodal Reasoning Benchmark for Process Errors Identification

Zhaopan Xu, Pengfei Zhou, Jiaxin Ai, Wangbo Zhao, Kai Wang, Xiaojiang Peng, Wenqi Shao, Hongxun Yao, Kaipeng Zhang

TL;DR

MPBench addresses the gap in evaluating multimodal process reward models by presenting a comprehensive benchmark with three evaluation paradigms—Step Correctness, Answer Aggregation, and Reasoning Process Search—and a large, multimodal dataset derived from M^3CoT. It enables fine-grained assessment of PRMs during training and inference and analyzes 12 LLMs, including GPT-4o and Gemini variants, to reveal scale effects, cross-paradigm correlations, and domain-specific challenges. The study finds that model scale significantly impacts complex tasks like step correctness and search, while correlations among abilities are positive but nuanced, with mathematics proving the most demanding domain. These findings guide future development of multimodal PRMs and suggest domain-aware training or prompting strategies to enhance reasoning accuracy in real-world tasks.

Abstract

Reasoning is an essential capacity for large language models (LLMs) to address complex tasks, where the identification of process errors is vital for improving this ability. Recently, process-level reward models (PRMs) were proposed to provide step-wise rewards that facilitate reinforcement learning and data production during training and guide LLMs toward correct steps during inference, thereby improving reasoning accuracy. However, existing benchmarks of PRMs are text-based and focus on error detection, neglecting other scenarios like reasoning search. To address this gap, we introduce MPBench, a comprehensive, multi-task, multimodal benchmark designed to systematically assess the effectiveness of PRMs in diverse scenarios. MPBench employs three evaluation paradigms, each targeting a specific role of PRMs in the reasoning process: (1) Step Correctness, which assesses the correctness of each intermediate reasoning step; (2) Answer Aggregation, which aggregates multiple solutions and selects the best one; and (3) Reasoning Process Search, which guides the search for optimal reasoning steps during inference. Through these paradigms, MPBench makes comprehensive evaluations and provides insights into the development of multimodal PRMs.

MPBench: A Comprehensive Multimodal Reasoning Benchmark for Process Errors Identification

TL;DR

MPBench addresses the gap in evaluating multimodal process reward models by presenting a comprehensive benchmark with three evaluation paradigms—Step Correctness, Answer Aggregation, and Reasoning Process Search—and a large, multimodal dataset derived from M^3CoT. It enables fine-grained assessment of PRMs during training and inference and analyzes 12 LLMs, including GPT-4o and Gemini variants, to reveal scale effects, cross-paradigm correlations, and domain-specific challenges. The study finds that model scale significantly impacts complex tasks like step correctness and search, while correlations among abilities are positive but nuanced, with mathematics proving the most demanding domain. These findings guide future development of multimodal PRMs and suggest domain-aware training or prompting strategies to enhance reasoning accuracy in real-world tasks.

Abstract

Reasoning is an essential capacity for large language models (LLMs) to address complex tasks, where the identification of process errors is vital for improving this ability. Recently, process-level reward models (PRMs) were proposed to provide step-wise rewards that facilitate reinforcement learning and data production during training and guide LLMs toward correct steps during inference, thereby improving reasoning accuracy. However, existing benchmarks of PRMs are text-based and focus on error detection, neglecting other scenarios like reasoning search. To address this gap, we introduce MPBench, a comprehensive, multi-task, multimodal benchmark designed to systematically assess the effectiveness of PRMs in diverse scenarios. MPBench employs three evaluation paradigms, each targeting a specific role of PRMs in the reasoning process: (1) Step Correctness, which assesses the correctness of each intermediate reasoning step; (2) Answer Aggregation, which aggregates multiple solutions and selects the best one; and (3) Reasoning Process Search, which guides the search for optimal reasoning steps during inference. Through these paradigms, MPBench makes comprehensive evaluations and provides insights into the development of multimodal PRMs.

Paper Structure

This paper contains 43 sections, 2 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: An overview of our MPBench. Left: data curation procedure. Right: evaluation paradigms: Step Correctness, Answer Aggregation, and Reasoning Process Search, highlighting the assessment of PRM performance through various tasks such as identifying errors, aggregating answers, and guiding reasoning steps
  • Figure 2: Performance breakdown on MPBench.
  • Figure 3: Interrelationship between a model’s capabilities in step correctness identify, answer aggregation, and reasoning process search. Each point on the graph represents a model, with coordinates indicating its performance in step correctness identify(SC), answer aggregation (AA), and reasoning process search (RS). The graph features fitted lines for the scatter plots, denoted by blue lines for SC/AA, SC/RS, and AA/RS, while a red dashed line represents the ideal growth line. The slope of this ideal growth line is the ratio of the random values of each metric.
  • Figure 4: Impact of Error Position on Model Performance. (a) Distribution of error positions within the dataset. (b) Model performance on reasoning process search, measured by average F1 score and MCC, across different error positions. (c) Model performance on Step Correctness, measured by F1 score, across different error positions. Note: Step 1 and steps beyond 10 are truncated for improved visualization.
  • Figure 5: The impact of ICL few-shot numbers on model performance.
  • ...and 3 more figures