Table of Contents
Fetching ...

ITIScore: An Image-to-Text-to-Image Rating Framework for the Image Captioning Ability of MLLMs

Zitong Xu, Huiyu Duan, Shengyao Qin, Guangyu Yao, Guangji Ma, Xiongkuo Min, Ke Gu, Guangtao Zhai, Patrick Le Callet

Abstract

Recent advances in multimodal large language models (MLLMs) have greatly improved image understanding and captioning capabilities. However, existing image captioning benchmarks typically suffer from limited diversity in caption length, the absence of recent advanced MLLMs, and insufficient human annotations, which potentially introduces bias and limits the ability to comprehensively assess the performance of modern MLLMs. To address these limitations, we present a new large-scale image captioning benchmark, termed, ICBench, which covers 12 content categories and consists of both short and long captions generated by 10 advanced MLLMs on 2K images, resulting in 40K captions in total. We conduct extensive human subjective studies to obtain mean opinion scores (MOSs) across fine-grained evaluation dimensions, where short captions are assessed in terms of fluency, relevance, and conciseness, while long captions are evaluated based on fluency, relevance, and completeness. Furthermore, we propose an automated evaluation metric, \textbf{ITIScore}, based on an image-to-text-to-image framework, which measures caption quality through reconstruction consistency. Experimental results demonstrate strong alignment between our automatic metric and human judgments, as well as robust zero-shot generalization ability on other public captioning datasets. Both the dataset and model will be released upon publication.

ITIScore: An Image-to-Text-to-Image Rating Framework for the Image Captioning Ability of MLLMs

Abstract

Recent advances in multimodal large language models (MLLMs) have greatly improved image understanding and captioning capabilities. However, existing image captioning benchmarks typically suffer from limited diversity in caption length, the absence of recent advanced MLLMs, and insufficient human annotations, which potentially introduces bias and limits the ability to comprehensively assess the performance of modern MLLMs. To address these limitations, we present a new large-scale image captioning benchmark, termed, ICBench, which covers 12 content categories and consists of both short and long captions generated by 10 advanced MLLMs on 2K images, resulting in 40K captions in total. We conduct extensive human subjective studies to obtain mean opinion scores (MOSs) across fine-grained evaluation dimensions, where short captions are assessed in terms of fluency, relevance, and conciseness, while long captions are evaluated based on fluency, relevance, and completeness. Furthermore, we propose an automated evaluation metric, \textbf{ITIScore}, based on an image-to-text-to-image framework, which measures caption quality through reconstruction consistency. Experimental results demonstrate strong alignment between our automatic metric and human judgments, as well as robust zero-shot generalization ability on other public captioning datasets. Both the dataset and model will be released upon publication.

Paper Structure

This paper contains 19 sections, 9 equations, 3 figures, 7 tables.

Figures (3)

  • Figure 1: MOS distribution of short caption and long caption across different evaluation dimensions.
  • Figure 2: Performance comparison of different MLLMs on short captioning in terms of fluency, relevance, and conciseness, and on long captioning in terms of fluency, relevance, and completeness across different image contents.
  • Figure 3: Overview of our ITIScore. Given an image and its caption, a pretrained generative model reconstructs an image from the caption. The original image, reconstructed image, and caption are jointly fed into a multimodal large language model to obtain a unified representation. A lightweight MLP scoring head then predicts the mean score and uncertainty for each evaluation dimension, enabling uncertainty-aware multi-dimensional caption quality assessment.