Table of Contents
Fetching ...

UPME: An Unsupervised Peer Review Framework for Multimodal Large Language Model Evaluation

Qihui Zhang, Munan Ning, Zheyuan Liu, Yanbo Wang, Jiayi Ye, Yue Huang, Shuo Yang, Xiao Chen, Yibing Song, Li Yuan

TL;DR

UPME introduces the first unsupervised peer-review framework for evaluating multimodal large language models using only image data to generate questions and a vision-language scoring system to assess answers across correctness, visual understanding, and image-text alignment. A dynamic weight optimization scheme aligns the unsupervised scores with human benchmarks, achieving high Pearson and Spearman correlations (MMStar: 0.944/0.972; ScienceQA: 0.814/0.886) and demonstrating reduced verbosity and self-preference biases relative to traditional MLLM-as-a-judge methods. The approach scales evaluation by minimizing human annotation, improves alignment with human preferences, and remains robust to hyperparameter variations and dataset diversity. These contributions offer a practical, scalable, and bias-mitigating pathway for objective MLLM evaluation in multimodal settings.

Abstract

Multimodal Large Language Models (MLLMs) have emerged to tackle the challenges of Visual Question Answering (VQA), sparking a new research focus on conducting objective evaluations of these models. Existing evaluation methods face limitations due to the significant human workload required to design Q&A pairs for visual images, which inherently restricts the scale and scope of evaluations. Although automated MLLM-as-judge approaches attempt to reduce the human workload through automatic evaluations, they often introduce biases. To address these problems, we propose an Unsupervised Peer review MLLM Evaluation framework. It utilizes only image data, allowing models to automatically generate questions and conduct peer review assessments of answers from other models, effectively alleviating the reliance on human workload. Additionally, we introduce the vision-language scoring system to mitigate the bias issues, which focuses on three aspects: (i) response correctness; (ii) visual understanding and reasoning; and (iii) image-text correlation. Experimental results demonstrate that UPME achieves a Pearson correlation of 0.944 with human evaluations on the MMstar dataset and 0.814 on the ScienceQA dataset, indicating that our framework closely aligns with human-designed benchmarks and inherent human preferences.

UPME: An Unsupervised Peer Review Framework for Multimodal Large Language Model Evaluation

TL;DR

UPME introduces the first unsupervised peer-review framework for evaluating multimodal large language models using only image data to generate questions and a vision-language scoring system to assess answers across correctness, visual understanding, and image-text alignment. A dynamic weight optimization scheme aligns the unsupervised scores with human benchmarks, achieving high Pearson and Spearman correlations (MMStar: 0.944/0.972; ScienceQA: 0.814/0.886) and demonstrating reduced verbosity and self-preference biases relative to traditional MLLM-as-a-judge methods. The approach scales evaluation by minimizing human annotation, improves alignment with human preferences, and remains robust to hyperparameter variations and dataset diversity. These contributions offer a practical, scalable, and bias-mitigating pathway for objective MLLM evaluation in multimodal settings.

Abstract

Multimodal Large Language Models (MLLMs) have emerged to tackle the challenges of Visual Question Answering (VQA), sparking a new research focus on conducting objective evaluations of these models. Existing evaluation methods face limitations due to the significant human workload required to design Q&A pairs for visual images, which inherently restricts the scale and scope of evaluations. Although automated MLLM-as-judge approaches attempt to reduce the human workload through automatic evaluations, they often introduce biases. To address these problems, we propose an Unsupervised Peer review MLLM Evaluation framework. It utilizes only image data, allowing models to automatically generate questions and conduct peer review assessments of answers from other models, effectively alleviating the reliance on human workload. Additionally, we introduce the vision-language scoring system to mitigate the bias issues, which focuses on three aspects: (i) response correctness; (ii) visual understanding and reasoning; and (iii) image-text correlation. Experimental results demonstrate that UPME achieves a Pearson correlation of 0.944 with human evaluations on the MMstar dataset and 0.814 on the ScienceQA dataset, indicating that our framework closely aligns with human-designed benchmarks and inherent human preferences.

Paper Structure

This paper contains 40 sections, 10 equations, 11 figures, 8 tables, 1 algorithm.

Figures (11)

  • Figure 1: Existing methods for evaluating MLLMs face various challenges. Our proposed UPME framework addresses these limitations by leveraging a peer review mechanism, reducing annotation costs, and aligning closely with human judgment.
  • Figure 2: The UPME framework consists of three main components: $(i)$ Peer Review Mechanism, where two candidate models and one review model are randomly selected from the MLLM pool. The review model generates questions based on a selected image, and candidate models provide responses. $(ii)$ Vision-Language Judgment Scoring System, which evaluates answers based on textual correctness, visual understanding and reasoning, and image-text correlation. $(iii)$ Dynamic Weight Optimization, ensuring consistency between confidence weights and estimated scores through iterative optimization cycles.
  • Figure 3: Convergence experiments.
  • Figure 4: The performance of UPME in different sample size.
  • Figure 5: Model accuracy comparison in peer review framework w/ and w/o UPME, where Peer Review_Cor. represents the correctness of the original peer review, and UPME_Cor. and UPME_Vis. correspond to the two judgment dimensions of response correctness and visual understanding, introduced in \ref{['3.2system']}.
  • ...and 6 more figures