ProBench: Judging Multimodal Foundation Models on Open-ended Multi-domain Expert Tasks
Yan Yang, Dongxu Li, Haoning Wu, Bei Chen, Liu Liu, Liyuan Pan, Junnan Li
TL;DR
ProBench introduces an automatic, open-ended benchmark for evaluating multimodal foundation models on expert tasks. It aggregates 4,000 queries across 10 fields and 56 sub-fields, supports 17 languages and conversations up to 13 turns, and benchmarks 24 MLLMs using an MLLM-as-a-Judge with an Elo-style leaderboard augmented by a Bradley-Terry model for style bias control ($α=400$, $K=32$). The results show that strong open-source models can rival proprietary ones, yet the benchmark reveals persistent challenges in visual perception, domain knowledge, and long-context reasoning, especially under multilingual and multi-round settings; human agreement with the judge reaches 79.9%. A distilled local evaluator based on Llama-3.2-11B-Vision-Instruct is proposed to enable cost-efficient, private evaluation. Overall, ProBench provides a robust framework to align MLLM development with high-value professional tasks and real-world decision-making.
Abstract
Solving expert-level multimodal tasks is a key milestone towards general intelligence. As the capabilities of multimodal large language models (MLLMs) continue to improve, evaluation of such advanced multimodal intelligence becomes necessary yet challenging. In this work, we introduce ProBench, a benchmark of open-ended user queries that require professional expertise and advanced reasoning. ProBench consists of 4,000 high-quality samples independently submitted by professionals based on their daily productivity demands. It spans across 10 fields and 56 sub-fields, including science, arts, humanities, coding, mathematics, and creative writing. Experimentally, we evaluate and compare 24 latest models using MLLM-as-a-Judge. Our results reveal that although the best open-source models rival the proprietary ones, ProBench presents significant challenges in visual perception, textual understanding, domain knowledge and advanced reasoning, thus providing valuable directions for future multimodal AI research efforts.
