Table of Contents
Fetching ...

Judge Anything: MLLM as a Judge Across Any Modality

Shu Pu, Yaochen Wang, Dongping Chen, Yuhang Chen, Guohao Wang, Qi Qin, Zhongyi Zhang, Zhiyuan Zhang, Zetong Zhou, Shuang Gong, Yi Gui, Yao Wan, Philip S. Yu

TL;DR

The paper tackles the problem of evaluating open-ended multimodal understanding and generation across diverse modalities by proposing automated judgment with Multimodal LLMs. It introduces two benchmarks, TaskAnything and JudgeAnything, to assess overall performance and judging capability across any-to-any modality tasks, using a four-stage benchmark construction process and human-annotated ground truth for validation. The study finds that MLLMs can align with human judgments on MMU tasks but struggle with MMG, revealing cross-modality biases and hallucinations; it shows that structured rubrics and sample-wise checklists improve alignment in some settings while hindering in others. To advance fair, scalable evaluation, the authors present OmniArena, a standardized platform for evaluating omni-models and multimodal rewards, and call for stronger cross-modal evaluation protocols that better reflect human preferences and real-world use cases.

Abstract

Evaluating generative foundation models on open-ended multimodal understanding (MMU) and generation (MMG) tasks across diverse modalities (e.g., images, audio, video) poses significant challenges due to the complexity of cross-modal interactions. To this end, the idea of utilizing Multimodal LLMs (MLLMs) as automated judges has emerged, with encouraging results in assessing vision-language understanding tasks. Moving further, this paper extends MLLM-as-a-Judge across modalities to a unified manner by introducing two benchmarks, TaskAnything and JudgeAnything, to respectively evaluate the overall performance and judging capabilities of MLLMs across any-to-any modality tasks. Specifically, TaskAnything evaluates the MMU and MMG capabilities across 15 any-to-any modality categories, employing 1,500 queries curated from well-established benchmarks. Furthermore, JudgeAnything evaluates the judging capabilities of 5 advanced (e.g., GPT-4o and Gemini-2.0-Flash) from the perspectives of Pair Comparison and Score Evaluation, providing a standardized testbed that incorporates human judgments and detailed rubrics. Our extensive experiments reveal that while these MLLMs show promise in assessing MMU (i.e., achieving an average of 66.55% in Pair Comparison setting and 42.79% in Score Evaluation setting), they encounter significant challenges with MMG tasks (i.e., averaging only 53.37% in Pair Comparison setting and 30.05% in Score Evaluation setting), exposing cross-modality biases and hallucination issues. To address this, we present OmniArena, an automated platform for evaluating omni-models and multimodal reward models. Our work highlights the need for fairer evaluation protocols and stronger alignment with human preferences. The source code and dataset are publicly available at: https://urrealhero.github.io/judgeanythingweb/.

Judge Anything: MLLM as a Judge Across Any Modality

TL;DR

The paper tackles the problem of evaluating open-ended multimodal understanding and generation across diverse modalities by proposing automated judgment with Multimodal LLMs. It introduces two benchmarks, TaskAnything and JudgeAnything, to assess overall performance and judging capability across any-to-any modality tasks, using a four-stage benchmark construction process and human-annotated ground truth for validation. The study finds that MLLMs can align with human judgments on MMU tasks but struggle with MMG, revealing cross-modality biases and hallucinations; it shows that structured rubrics and sample-wise checklists improve alignment in some settings while hindering in others. To advance fair, scalable evaluation, the authors present OmniArena, a standardized platform for evaluating omni-models and multimodal rewards, and call for stronger cross-modal evaluation protocols that better reflect human preferences and real-world use cases.

Abstract

Evaluating generative foundation models on open-ended multimodal understanding (MMU) and generation (MMG) tasks across diverse modalities (e.g., images, audio, video) poses significant challenges due to the complexity of cross-modal interactions. To this end, the idea of utilizing Multimodal LLMs (MLLMs) as automated judges has emerged, with encouraging results in assessing vision-language understanding tasks. Moving further, this paper extends MLLM-as-a-Judge across modalities to a unified manner by introducing two benchmarks, TaskAnything and JudgeAnything, to respectively evaluate the overall performance and judging capabilities of MLLMs across any-to-any modality tasks. Specifically, TaskAnything evaluates the MMU and MMG capabilities across 15 any-to-any modality categories, employing 1,500 queries curated from well-established benchmarks. Furthermore, JudgeAnything evaluates the judging capabilities of 5 advanced (e.g., GPT-4o and Gemini-2.0-Flash) from the perspectives of Pair Comparison and Score Evaluation, providing a standardized testbed that incorporates human judgments and detailed rubrics. Our extensive experiments reveal that while these MLLMs show promise in assessing MMU (i.e., achieving an average of 66.55% in Pair Comparison setting and 42.79% in Score Evaluation setting), they encounter significant challenges with MMG tasks (i.e., averaging only 53.37% in Pair Comparison setting and 30.05% in Score Evaluation setting), exposing cross-modality biases and hallucination issues. To address this, we present OmniArena, an automated platform for evaluating omni-models and multimodal reward models. Our work highlights the need for fairer evaluation protocols and stronger alignment with human preferences. The source code and dataset are publicly available at: https://urrealhero.github.io/judgeanythingweb/.

Paper Structure

This paper contains 28 sections, 4 equations, 47 figures, 9 tables.

Figures (47)

  • Figure 1: The construction of TaskAnything and JudgeAnything follows a systematic four-step approach. First, we compile open-ended any-to-any instructions from existing benchmarks and datasets, followed by rigorous human annotation to ensure sample diversity and quality in TaskAnything. Subsequently, we collect model responses and develop evaluation principles through an Human-MLLM collaborative approach, creating detailed assessment checklists for each sample. Finally, we curate instruction-responses pairs to evaluate the effectiveness of MLLM-as-a-Judge in any-to-any generation tasks, benchmarking these automated assessments against expert human judgments.
  • Figure 2: TaskAnything and JudgeAnything comprise 15 any-to-any combinations spanning text, image, video, and audio modalities. The TaskAnything samples are curated from established benchmarks, while responses to queries are generated using state-of-the-art models to construct JudgeAnything in both Pair Comparison and Score Evaluation settings.
  • Figure 3: Visualization of MMU and MMG categories with human agreement data. Left: Accuracy scores for the Pair Comparison setting across two categories. Right: Agreement scores for the Score Evaluation setting across two categories. The dotted line connects the same baseline from MMU to MMG to highlight the trend.
  • Figure 4: Visualization of modality effect on human agreement across two settings using checklist-of-thought.
  • Figure 5: Correlation heatmaps for different judges: OC (Overall Choice), Rel (Relevance), Tru (Trustworthiness), Cre (Creativity & Novelty), Cla (Clarity), Coh (Coherence), Com (Completeness).
  • ...and 42 more figures