Table of Contents
Fetching ...

Vision Language Model-based Caption Evaluation Method Leveraging Visual Context Extraction

Koki Maeda, Shuhei Kurita, Taiki Miyanishi, Naoaki Okazaki

TL;DR

VisCE^2 addresses the gap between automatic caption scores and human judgments by replacing human references with structured visual context and evaluating captions via a vision-language model. The method comprises two stages: extracting a bullet-list visual context (objects, attributes, relationships) and prompting a VLM to rate a candidate caption on a scale from $0$ to $100$, with postprocessing to extract the score. Across THumB, Flickr8k-Expert, Composite, and Pascal-50S CIDEr, VisCE^2 demonstrates higher correlation with human judgments than traditional metrics, and results with GPT-4V indicate strong upper-bound performance when combined with VisCE^2. While offering improved fidelity to human preferences, the approach incurs higher computational cost and exhibits prompt sensitivity, suggesting avenues for robustness and broader adoption in modern VLM-enabled caption evaluation.

Abstract

Given the accelerating progress of vision and language modeling, accurate evaluation of machine-generated image captions remains critical. In order to evaluate captions more closely to human preferences, metrics need to discriminate between captions of varying quality and content. However, conventional metrics fail short of comparing beyond superficial matches of words or embedding similarities; thus, they still need improvement. This paper presents VisCE$^2$, a vision language model-based caption evaluation method. Our method focuses on visual context, which refers to the detailed content of images, including objects, attributes, and relationships. By extracting and organizing them into a structured format, we replace the human-written references with visual contexts and help VLMs better understand the image, enhancing evaluation performance. Through meta-evaluation on multiple datasets, we validated that VisCE$^2$ outperforms the conventional pre-trained metrics in capturing caption quality and demonstrates superior consistency with human judgment.

Vision Language Model-based Caption Evaluation Method Leveraging Visual Context Extraction

TL;DR

VisCE^2 addresses the gap between automatic caption scores and human judgments by replacing human references with structured visual context and evaluating captions via a vision-language model. The method comprises two stages: extracting a bullet-list visual context (objects, attributes, relationships) and prompting a VLM to rate a candidate caption on a scale from to , with postprocessing to extract the score. Across THumB, Flickr8k-Expert, Composite, and Pascal-50S CIDEr, VisCE^2 demonstrates higher correlation with human judgments than traditional metrics, and results with GPT-4V indicate strong upper-bound performance when combined with VisCE^2. While offering improved fidelity to human preferences, the approach incurs higher computational cost and exhibits prompt sensitivity, suggesting avenues for robustness and broader adoption in modern VLM-enabled caption evaluation.

Abstract

Given the accelerating progress of vision and language modeling, accurate evaluation of machine-generated image captions remains critical. In order to evaluate captions more closely to human preferences, metrics need to discriminate between captions of varying quality and content. However, conventional metrics fail short of comparing beyond superficial matches of words or embedding similarities; thus, they still need improvement. This paper presents VisCE, a vision language model-based caption evaluation method. Our method focuses on visual context, which refers to the detailed content of images, including objects, attributes, and relationships. By extracting and organizing them into a structured format, we replace the human-written references with visual contexts and help VLMs better understand the image, enhancing evaluation performance. Through meta-evaluation on multiple datasets, we validated that VisCE outperforms the conventional pre-trained metrics in capturing caption quality and demonstrates superior consistency with human judgment.
Paper Structure (31 sections, 4 figures, 6 tables)

This paper contains 31 sections, 4 figures, 6 tables.

Figures (4)

  • Figure 1: Overview of automatic caption quality evaluation by VisCE$^2$ and example of the input/output. First, VLM extracts the visual context from the image, organized in a bullet list format, presenting objects, object features, and relationships between objects. Then, VLM evaluates the caption using the obtained visual context along with the image and candidate caption.
  • Figure 2: Heatmaps of human rating and automatic evaluation scores on THumB (left), Flickr8k-Expert (mid) and Composite (right). Normalized for each human evaluation score (i.e., rows). The human evaluation of THumB is referenced to the total score.
  • Figure 3: Comparison between evaluation scores of VisCE$^2$, that of CLIP-S, and human ratings for candidate caption for images from the Composite dataset.
  • Figure 4: Comparison between evaluation scores of VisCE$^2$ and classic reference-based metrics. The image and the references are from the MS-COCO dataset. Both captions are automatically generated by GPT-4V.