Multi-modal, Multi-task, Multi-criteria Automatic Evaluation with Vision Language Models
Masanari Ohi, Masahiro Kaneko, Naoaki Okazaki, Nakamasa Inoue
TL;DR
This work addresses the challenge of evaluating vision-language models across multiple tasks by introducing HarmonicEval, a reference-free metric that aggregates multiple criterion-wise scores into a single overall score. The method uses criterion-wise prompts to assess text quality on five criteria and then combines these scores via harmonic weighting that accounts for the reliability of each criterion. The authors also introduce MMHE, a large meta-evaluation dataset with 18,000 human judgments across four multimodal tasks and five criteria, enabling robust assessment of generalization and criterion priorities. Empirical results show HarmonicEval achieves higher correlations with human judgments than conventional metrics on MMHE and maintains strong performance on standard image captioning benchmarks, while providing interpretable, criterion-level diagnostics. Overall, HarmonicEval advances reliable, explainable, and task-agnostic evaluation for VLM outputs, with MMHE enabling deeper analysis of how metrics align with human judgments across diverse settings.
Abstract
Vision-language models (VLMs) have shown impressive abilities across a range of multi-modal tasks. However, existing metrics for evaluating the quality of text generated by VLMs typically focus on an overall evaluation for a specific task, such as image captioning. While the overall evaluation is essential for any task, the criteria prioritized can differ depending on the task, making it challenging for current metrics to adapt to multi-task scenarios. To address this limitation, we propose HarmonicEval, a reference-free comprehensive evaluation metric that aggregates criterion-wise scores to produce the overall score in a bottom-up manner. Furthermore, we construct the Multi-task Multi-criteria Human Evaluation (MMHE) dataset, which comprises 18,000 expert human judgments across four multi-modal tasks. Our experiments demonstrate that HarmonicEval achieves higher correlations with human judgments than conventional metrics while providing numerical scores for each criterion.
