Table of Contents
Fetching ...

Multi-modal, Multi-task, Multi-criteria Automatic Evaluation with Vision Language Models

Masanari Ohi, Masahiro Kaneko, Naoaki Okazaki, Nakamasa Inoue

TL;DR

This work addresses the challenge of evaluating vision-language models across multiple tasks by introducing HarmonicEval, a reference-free metric that aggregates multiple criterion-wise scores into a single overall score. The method uses criterion-wise prompts to assess text quality on five criteria and then combines these scores via harmonic weighting that accounts for the reliability of each criterion. The authors also introduce MMHE, a large meta-evaluation dataset with 18,000 human judgments across four multimodal tasks and five criteria, enabling robust assessment of generalization and criterion priorities. Empirical results show HarmonicEval achieves higher correlations with human judgments than conventional metrics on MMHE and maintains strong performance on standard image captioning benchmarks, while providing interpretable, criterion-level diagnostics. Overall, HarmonicEval advances reliable, explainable, and task-agnostic evaluation for VLM outputs, with MMHE enabling deeper analysis of how metrics align with human judgments across diverse settings.

Abstract

Vision-language models (VLMs) have shown impressive abilities across a range of multi-modal tasks. However, existing metrics for evaluating the quality of text generated by VLMs typically focus on an overall evaluation for a specific task, such as image captioning. While the overall evaluation is essential for any task, the criteria prioritized can differ depending on the task, making it challenging for current metrics to adapt to multi-task scenarios. To address this limitation, we propose HarmonicEval, a reference-free comprehensive evaluation metric that aggregates criterion-wise scores to produce the overall score in a bottom-up manner. Furthermore, we construct the Multi-task Multi-criteria Human Evaluation (MMHE) dataset, which comprises 18,000 expert human judgments across four multi-modal tasks. Our experiments demonstrate that HarmonicEval achieves higher correlations with human judgments than conventional metrics while providing numerical scores for each criterion.

Multi-modal, Multi-task, Multi-criteria Automatic Evaluation with Vision Language Models

TL;DR

This work addresses the challenge of evaluating vision-language models across multiple tasks by introducing HarmonicEval, a reference-free metric that aggregates multiple criterion-wise scores into a single overall score. The method uses criterion-wise prompts to assess text quality on five criteria and then combines these scores via harmonic weighting that accounts for the reliability of each criterion. The authors also introduce MMHE, a large meta-evaluation dataset with 18,000 human judgments across four multimodal tasks and five criteria, enabling robust assessment of generalization and criterion priorities. Empirical results show HarmonicEval achieves higher correlations with human judgments than conventional metrics on MMHE and maintains strong performance on standard image captioning benchmarks, while providing interpretable, criterion-level diagnostics. Overall, HarmonicEval advances reliable, explainable, and task-agnostic evaluation for VLM outputs, with MMHE enabling deeper analysis of how metrics align with human judgments across diverse settings.

Abstract

Vision-language models (VLMs) have shown impressive abilities across a range of multi-modal tasks. However, existing metrics for evaluating the quality of text generated by VLMs typically focus on an overall evaluation for a specific task, such as image captioning. While the overall evaluation is essential for any task, the criteria prioritized can differ depending on the task, making it challenging for current metrics to adapt to multi-task scenarios. To address this limitation, we propose HarmonicEval, a reference-free comprehensive evaluation metric that aggregates criterion-wise scores to produce the overall score in a bottom-up manner. Furthermore, we construct the Multi-task Multi-criteria Human Evaluation (MMHE) dataset, which comprises 18,000 expert human judgments across four multi-modal tasks. Our experiments demonstrate that HarmonicEval achieves higher correlations with human judgments than conventional metrics while providing numerical scores for each criterion.

Paper Structure

This paper contains 72 sections, 10 equations, 5 figures, 9 tables.

Figures (5)

  • Figure 1: Multi-task and multi-criteria evaluation. (a) Conventional single-criterion approach focuses on a single task, such as image captioning. (b) HarmonicEval integrates multiple criteria to provide overall scores. MMHE dataset consists of 18,000 expert human judgments across four multi-modal tasks and five criteria.
  • Figure 2: HarmonicEval framework consists of two steps. (a) Criterion-wise scoring is performed by prompting a VLM to evaluate the input text based on each criterion, followed by score smoothing to improve robustness based on the first-order statistics. (b) Score aggregation produces an overall score using harmonic weighting based on the second-order statistics, aiming to reduce statistical fluctuations.
  • Figure 3: MMHE dataset is a multi-task multi-criteria human evaluation dataset. Each candidate text is manually evaluated by three expert annotators.
  • Figure 4: Human judgment score distributions for each task and criterion on the MMHE dataset.
  • Figure 5: Qualitative examples.

Theorems & Definitions (1)

  • proof