Table of Contents
Fetching ...

Multi-Crit: Benchmarking Multimodal Judges on Pluralistic Criteria-Following

Tianyi Xiong, Yi Ge, Ming Li, Zuolong Zhang, Pranav Kulkarni, Kaishen Wang, Qi He, Zeying Zhu, Chenxi Liu, Ruibo Chen, Tong Zheng, Yanshuo Chen, Xiyao Wang, Renrui Zhang, Wenhu Chen, Heng Huang

TL;DR

Multi-Crit introduces a rigorous benchmark to evaluate multimodal judges on pluralistic, criterion-level judgments across open-ended generation and verifiable reasoning. It couples a carefully curated dataset with three metrics ($\text{PAcc}$, $\text{TOS}$, $\text{CMR}$) to assess adherence to diverse criteria, trade-off awareness, and conflict resolution. Across 25 LMMs, results reveal that proprietary models struggle with consistent pluralistic adherence in open-ended tasks, open-source models lag in criterion-following, and critic-fine-tuning mainly improves visual grounding without generalizing to criterion-level judgments. The work also analyzes reasoning-finetuning, test-time scaling, and human-alignment bounds, establishing Multi-Crit as a foundational benchmark for building reliable, steerable multimodal evaluation systems.

Abstract

Large multimodal models (LMMs) are increasingly adopted as judges in multimodal evaluation systems due to their strong instruction following and consistency with human preferences. However, their ability to follow diverse, fine-grained evaluation criteria remains underexplored. We develop Multi-Crit, a benchmark for evaluating multimodal judges on their capacity to follow pluralistic criteria and produce reliable criterion-level judgments. Covering both open-ended generation and verifiable reasoning tasks, Multi-Crit is built through a rigorous data curation pipeline that gathers challenging response pairs with multi-criterion human annotations. It further introduces three novel metrics for systematically assessing pluralistic adherence, criterion-switching flexibility, and the ability to recognize criterion-level preference conflicts. Comprehensive analysis of 25 LMMs reveals that 1) proprietary models still struggle to maintain consistent adherence to pluralistic criteria--especially in open-ended evaluation; 2) open-source models lag further behind in flexibly following diverse criteria; and 3) critic fine-tuning with holistic judgment signals enhances visual grounding but fails to generalize to pluralistic criterion-level judgment. Additional analyses on reasoning fine-tuning, test-time scaling, and boundary consistency between open-source and proprietary models further probe the limits of current multimodal judges. As a pioneering study, Multi-Crit lays the foundation for building reliable and steerable multimodal AI evaluation.

Multi-Crit: Benchmarking Multimodal Judges on Pluralistic Criteria-Following

TL;DR

Multi-Crit introduces a rigorous benchmark to evaluate multimodal judges on pluralistic, criterion-level judgments across open-ended generation and verifiable reasoning. It couples a carefully curated dataset with three metrics (, , ) to assess adherence to diverse criteria, trade-off awareness, and conflict resolution. Across 25 LMMs, results reveal that proprietary models struggle with consistent pluralistic adherence in open-ended tasks, open-source models lag in criterion-following, and critic-fine-tuning mainly improves visual grounding without generalizing to criterion-level judgments. The work also analyzes reasoning-finetuning, test-time scaling, and human-alignment bounds, establishing Multi-Crit as a foundational benchmark for building reliable, steerable multimodal evaluation systems.

Abstract

Large multimodal models (LMMs) are increasingly adopted as judges in multimodal evaluation systems due to their strong instruction following and consistency with human preferences. However, their ability to follow diverse, fine-grained evaluation criteria remains underexplored. We develop Multi-Crit, a benchmark for evaluating multimodal judges on their capacity to follow pluralistic criteria and produce reliable criterion-level judgments. Covering both open-ended generation and verifiable reasoning tasks, Multi-Crit is built through a rigorous data curation pipeline that gathers challenging response pairs with multi-criterion human annotations. It further introduces three novel metrics for systematically assessing pluralistic adherence, criterion-switching flexibility, and the ability to recognize criterion-level preference conflicts. Comprehensive analysis of 25 LMMs reveals that 1) proprietary models still struggle to maintain consistent adherence to pluralistic criteria--especially in open-ended evaluation; 2) open-source models lag further behind in flexibly following diverse criteria; and 3) critic fine-tuning with holistic judgment signals enhances visual grounding but fails to generalize to pluralistic criterion-level judgment. Additional analyses on reasoning fine-tuning, test-time scaling, and boundary consistency between open-source and proprietary models further probe the limits of current multimodal judges. As a pioneering study, Multi-Crit lays the foundation for building reliable and steerable multimodal AI evaluation.

Paper Structure

This paper contains 50 sections, 4 equations, 10 figures, 19 tables.

Figures (10)

  • Figure 1: Data Construction Pipeline. Multi-Crit is built from diverse prompts across open-ended and reasoning tasks, responses from various LMMs reflecting subtle quality distinctions, and multi-criterion human annotations highlighting preference conflicts across criteria.
  • Figure 1: Multi-Crit's key statistics.
  • Figure 2: Distribution of prompt sources (left) and evaluation criteria (right).
  • Figure 3: Average performance across each criterion. While the top model differs across criteria, all models show stronger pluralistic alignment in verifiable reasoning than in open-ended tasks.
  • Figure 4: Results of RL-tuned reasoning models on the Multi-Crit reasoning split, all based on Qwen2.5-VL-7B.
  • ...and 5 more figures