ProJudge: A Multi-Modal Multi-Discipline Benchmark and Instruction-Tuning Dataset for MLLM-based Process Judges

Jiaxin Ai; Pengfei Zhou; Zhaopan Xu; Ming Li; Fanrui Zhang; Zizhen Li; Jianwen Sun; Yukang Feng; Baojin Huang; Zhongyuan Wang; Kaipeng Zhang

ProJudge: A Multi-Modal Multi-Discipline Benchmark and Instruction-Tuning Dataset for MLLM-based Process Judges

Jiaxin Ai, Pengfei Zhou, Zhaopan Xu, Ming Li, Fanrui Zhang, Zizhen Li, Jianwen Sun, Yukang Feng, Baojin Huang, Zhongyuan Wang, Kaipeng Zhang

TL;DR

ProJudgeBench delivers a first-of-its-kind, multi-modal, multi-discipline benchmark with 2,400 test cases and 50,118 step-level labels to evaluate MLLM-based process judges on per-step correctness, error-type classification, and explanations. To close the reliability gap between open-source and proprietary models, the authors introduce ProJudge-173k, a large instruction-tuning dataset, and Dynamic Dual-Phase fine-tuning that separates reasoning from evaluation to improve robustness. Across extensive experiments, larger models excel as judges, fine-tuning substantially boosts open-source performance, and the Dynamic Dual-Phase approach narrows the gap while improving generalization to unseen problems. The work provides practical benchmarks and training strategies to advance trustworthy, multi-modal process evaluation in scientific reasoning.

Abstract

As multi-modal large language models (MLLMs) frequently exhibit errors when solving scientific problems, evaluating the validity of their reasoning processes is critical for ensuring reliability and uncovering fine-grained model weaknesses. Since human evaluation is laborious and costly, prompting MLLMs as automated process judges has become a common practice. However, the reliability of these model-based judges remains uncertain. To address this, we introduce ProJudgeBench, the first comprehensive benchmark specifically designed for evaluating abilities of MLLM-based process judges. ProJudgeBench comprises 2,400 test cases and 50,118 step-level labels, spanning four scientific disciplines with diverse difficulty levels and multi-modal content. In ProJudgeBench, each step is meticulously annotated by human experts for correctness, error type, and explanation, enabling a systematic evaluation of judges' capabilities to detect, classify and diagnose errors. Evaluation on ProJudgeBench reveals a significant performance gap between open-source and proprietary models. To bridge this gap, we further propose ProJudge-173k, a large-scale instruction-tuning dataset, and a Dynamic Dual-Phase fine-tuning strategy that encourages models to explicitly reason through problem-solving before assessing solutions. Both contributions significantly enhance the process evaluation capabilities of open-source models. All the resources will be released to foster future research of reliable multi-modal process evaluation.

ProJudge: A Multi-Modal Multi-Discipline Benchmark and Instruction-Tuning Dataset for MLLM-based Process Judges

TL;DR

Abstract

ProJudge: A Multi-Modal Multi-Discipline Benchmark and Instruction-Tuning Dataset for MLLM-based Process Judges

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (7)