Table of Contents
Fetching ...

IRR: Image Review Ranking Framework for Evaluating Vision-Language Models

Kazuki Hayashi, Kazuma Onishi, Toma Suzuki, Yusuke Ide, Seiji Gobara, Shigeki Saito, Yusuke Sakai, Hidetaka Kamigaito, Katsuhiko Hayashi, Taro Watanabe

TL;DR

IRR introduces an image-review ranking framework to evaluate vision-language systems on multi-perspective critiques of images, moving beyond single-reference factual evaluations. The approach uses a perplexity-based ranking task for each image's five generated reviews and compares model rankings to human judgments via Spearman correlation, enabling assessment of alignment with human reasoning. A curated dataset of 207 Wikipedia-derived images across 15 categories with English and Japanese reviews generated by GPT-4V supports cross-language analysis, revealing that current LVLMs achieve only moderate, imperfect agreement with humans and CLIP-based metrics are inadequate for this task. The work highlights the value of integrating inferential capabilities of LLMs with visual information for better alignment with human perspectives and underscores limitations and directions for improving cross-lingual, multi-perspective evaluation in vision-language tasks.

Abstract

Large-scale Vision-Language Models (LVLMs) process both images and text, excelling in multimodal tasks such as image captioning and description generation. However, while these models excel at generating factual content, their ability to generate and evaluate texts reflecting perspectives on the same image, depending on the context, has not been sufficiently explored. To address this, we propose IRR: Image Review Rank, a novel evaluation framework designed to assess critic review texts from multiple perspectives. IRR evaluates LVLMs by measuring how closely their judgments align with human interpretations. We validate it using a dataset of images from 15 categories, each with five critic review texts and annotated rankings in both English and Japanese, totaling over 2,000 data instances. The datasets are available at https://hf.co/datasets/naist-nlp/Wiki-ImageReview1.0. Our results indicate that, although LVLMs exhibited consistent performance across languages, their correlation with human annotations was insufficient, highlighting the need for further advancements. These findings highlight the limitations of current evaluation methods and the need for approaches that better capture human reasoning in Vision & Language tasks.

IRR: Image Review Ranking Framework for Evaluating Vision-Language Models

TL;DR

IRR introduces an image-review ranking framework to evaluate vision-language systems on multi-perspective critiques of images, moving beyond single-reference factual evaluations. The approach uses a perplexity-based ranking task for each image's five generated reviews and compares model rankings to human judgments via Spearman correlation, enabling assessment of alignment with human reasoning. A curated dataset of 207 Wikipedia-derived images across 15 categories with English and Japanese reviews generated by GPT-4V supports cross-language analysis, revealing that current LVLMs achieve only moderate, imperfect agreement with humans and CLIP-based metrics are inadequate for this task. The work highlights the value of integrating inferential capabilities of LLMs with visual information for better alignment with human perspectives and underscores limitations and directions for improving cross-lingual, multi-perspective evaluation in vision-language tasks.

Abstract

Large-scale Vision-Language Models (LVLMs) process both images and text, excelling in multimodal tasks such as image captioning and description generation. However, while these models excel at generating factual content, their ability to generate and evaluate texts reflecting perspectives on the same image, depending on the context, has not been sufficiently explored. To address this, we propose IRR: Image Review Rank, a novel evaluation framework designed to assess critic review texts from multiple perspectives. IRR evaluates LVLMs by measuring how closely their judgments align with human interpretations. We validate it using a dataset of images from 15 categories, each with five critic review texts and annotated rankings in both English and Japanese, totaling over 2,000 data instances. The datasets are available at https://hf.co/datasets/naist-nlp/Wiki-ImageReview1.0. Our results indicate that, although LVLMs exhibited consistent performance across languages, their correlation with human annotations was insufficient, highlighting the need for further advancements. These findings highlight the limitations of current evaluation methods and the need for approaches that better capture human reasoning in Vision & Language tasks.
Paper Structure (41 sections, 4 figures, 6 tables)

This paper contains 41 sections, 4 figures, 6 tables.

Figures (4)

  • Figure 1: Different three image-to-text generation tasks and their corresponding metrics.
  • Figure 2: Dataset Construction Process.
  • Figure 3: Correlation between prompt and human ranks.
  • Figure 4: Changes for remaining data count and average rank correlation when varying threshold. The bar graphs represent the remaining data count and the line graphs denote average rank correlation. Nan means no threshold.