IRR: Image Review Ranking Framework for Evaluating Vision-Language Models
Kazuki Hayashi, Kazuma Onishi, Toma Suzuki, Yusuke Ide, Seiji Gobara, Shigeki Saito, Yusuke Sakai, Hidetaka Kamigaito, Katsuhiko Hayashi, Taro Watanabe
TL;DR
IRR introduces an image-review ranking framework to evaluate vision-language systems on multi-perspective critiques of images, moving beyond single-reference factual evaluations. The approach uses a perplexity-based ranking task for each image's five generated reviews and compares model rankings to human judgments via Spearman correlation, enabling assessment of alignment with human reasoning. A curated dataset of 207 Wikipedia-derived images across 15 categories with English and Japanese reviews generated by GPT-4V supports cross-language analysis, revealing that current LVLMs achieve only moderate, imperfect agreement with humans and CLIP-based metrics are inadequate for this task. The work highlights the value of integrating inferential capabilities of LLMs with visual information for better alignment with human perspectives and underscores limitations and directions for improving cross-lingual, multi-perspective evaluation in vision-language tasks.
Abstract
Large-scale Vision-Language Models (LVLMs) process both images and text, excelling in multimodal tasks such as image captioning and description generation. However, while these models excel at generating factual content, their ability to generate and evaluate texts reflecting perspectives on the same image, depending on the context, has not been sufficiently explored. To address this, we propose IRR: Image Review Rank, a novel evaluation framework designed to assess critic review texts from multiple perspectives. IRR evaluates LVLMs by measuring how closely their judgments align with human interpretations. We validate it using a dataset of images from 15 categories, each with five critic review texts and annotated rankings in both English and Japanese, totaling over 2,000 data instances. The datasets are available at https://hf.co/datasets/naist-nlp/Wiki-ImageReview1.0. Our results indicate that, although LVLMs exhibited consistent performance across languages, their correlation with human annotations was insufficient, highlighting the need for further advancements. These findings highlight the limitations of current evaluation methods and the need for approaches that better capture human reasoning in Vision & Language tasks.
