Table of Contents
Fetching ...

ProJudge: A Multi-Modal Multi-Discipline Benchmark and Instruction-Tuning Dataset for MLLM-based Process Judges

Jiaxin Ai, Pengfei Zhou, Zhaopan Xu, Ming Li, Fanrui Zhang, Zizhen Li, Jianwen Sun, Yukang Feng, Baojin Huang, Zhongyuan Wang, Kaipeng Zhang

TL;DR

ProJudgeBench delivers a first-of-its-kind, multi-modal, multi-discipline benchmark with 2,400 test cases and 50,118 step-level labels to evaluate MLLM-based process judges on per-step correctness, error-type classification, and explanations. To close the reliability gap between open-source and proprietary models, the authors introduce ProJudge-173k, a large instruction-tuning dataset, and Dynamic Dual-Phase fine-tuning that separates reasoning from evaluation to improve robustness. Across extensive experiments, larger models excel as judges, fine-tuning substantially boosts open-source performance, and the Dynamic Dual-Phase approach narrows the gap while improving generalization to unseen problems. The work provides practical benchmarks and training strategies to advance trustworthy, multi-modal process evaluation in scientific reasoning.

Abstract

As multi-modal large language models (MLLMs) frequently exhibit errors when solving scientific problems, evaluating the validity of their reasoning processes is critical for ensuring reliability and uncovering fine-grained model weaknesses. Since human evaluation is laborious and costly, prompting MLLMs as automated process judges has become a common practice. However, the reliability of these model-based judges remains uncertain. To address this, we introduce ProJudgeBench, the first comprehensive benchmark specifically designed for evaluating abilities of MLLM-based process judges. ProJudgeBench comprises 2,400 test cases and 50,118 step-level labels, spanning four scientific disciplines with diverse difficulty levels and multi-modal content. In ProJudgeBench, each step is meticulously annotated by human experts for correctness, error type, and explanation, enabling a systematic evaluation of judges' capabilities to detect, classify and diagnose errors. Evaluation on ProJudgeBench reveals a significant performance gap between open-source and proprietary models. To bridge this gap, we further propose ProJudge-173k, a large-scale instruction-tuning dataset, and a Dynamic Dual-Phase fine-tuning strategy that encourages models to explicitly reason through problem-solving before assessing solutions. Both contributions significantly enhance the process evaluation capabilities of open-source models. All the resources will be released to foster future research of reliable multi-modal process evaluation.

ProJudge: A Multi-Modal Multi-Discipline Benchmark and Instruction-Tuning Dataset for MLLM-based Process Judges

TL;DR

ProJudgeBench delivers a first-of-its-kind, multi-modal, multi-discipline benchmark with 2,400 test cases and 50,118 step-level labels to evaluate MLLM-based process judges on per-step correctness, error-type classification, and explanations. To close the reliability gap between open-source and proprietary models, the authors introduce ProJudge-173k, a large instruction-tuning dataset, and Dynamic Dual-Phase fine-tuning that separates reasoning from evaluation to improve robustness. Across extensive experiments, larger models excel as judges, fine-tuning substantially boosts open-source performance, and the Dynamic Dual-Phase approach narrows the gap while improving generalization to unseen problems. The work provides practical benchmarks and training strategies to advance trustworthy, multi-modal process evaluation in scientific reasoning.

Abstract

As multi-modal large language models (MLLMs) frequently exhibit errors when solving scientific problems, evaluating the validity of their reasoning processes is critical for ensuring reliability and uncovering fine-grained model weaknesses. Since human evaluation is laborious and costly, prompting MLLMs as automated process judges has become a common practice. However, the reliability of these model-based judges remains uncertain. To address this, we introduce ProJudgeBench, the first comprehensive benchmark specifically designed for evaluating abilities of MLLM-based process judges. ProJudgeBench comprises 2,400 test cases and 50,118 step-level labels, spanning four scientific disciplines with diverse difficulty levels and multi-modal content. In ProJudgeBench, each step is meticulously annotated by human experts for correctness, error type, and explanation, enabling a systematic evaluation of judges' capabilities to detect, classify and diagnose errors. Evaluation on ProJudgeBench reveals a significant performance gap between open-source and proprietary models. To bridge this gap, we further propose ProJudge-173k, a large-scale instruction-tuning dataset, and a Dynamic Dual-Phase fine-tuning strategy that encourages models to explicitly reason through problem-solving before assessing solutions. Both contributions significantly enhance the process evaluation capabilities of open-source models. All the resources will be released to foster future research of reliable multi-modal process evaluation.

Paper Structure

This paper contains 41 sections, 2 equations, 7 figures, 9 tables.

Figures (7)

  • Figure 1: Task definition of process evaluation. For each step, MLLM-based process judges detect errors, classify error types and provide brief explanations. Based on these analyses, we derive insights into model weaknesses, guiding future improvements.
  • Figure 2: An overview for data construcion process of ProJudgeBench and ProJudge-173k.
  • Figure 3: Distribution of error types across different disciplines and difficulty levels in ProJudgeBench. K12 and Comp represent routine and competition-level problems, respectively.
  • Figure 4: Performance of MLLM-based process judges across different disciplines, difficulty levels and modalities.
  • Figure 5: Model-as-Judge Performance across different models. Each position (x, y) in the heatmap represents the accuracy of model x as a judge in assessing solutions generated by model y.
  • ...and 2 more figures