Table of Contents
Fetching ...

Evaluating Image Caption via Cycle-consistent Text-to-Image Generation

Tianyu Cui, Jinbin Bai, Guo-Hua Wang, Qing-Guo Chen, Zhao Xu, Weihua Luo, Kaifu Zhang, Ye Shi

TL;DR

CAMScore introduces a cyclic, reference-free evaluation for image captioning by generating $I_{gen}$ from captions via a text-to-image model and comparing it to the original image within the same modality. It uses a three-level evaluation (pixel, semantic, object) fused through an MLP to produce a caption quality score that aligns well with human judgments. Across Flickr8k-Expert, Flickr8k-CF, Composite, and Pascal-50S, CAMScore outperforms both reference-based and prior reference-free metrics in correlation and ranking accuracy, demonstrating strong, dataset-robust alignment with human assessments. The framework offers fine-grained diagnostic insights and, despite limitations tied to detector coverage and T2I speed, provides a practical, scalable tool for automatic image-caption evaluation with potential integration into model training.

Abstract

Evaluating image captions typically relies on reference captions, which are costly to obtain and exhibit significant diversity and subjectivity. While reference-free evaluation metrics have been proposed, most focus on cross-modal evaluation between captions and images. Recent research has revealed that the modality gap generally exists in the representation of contrastive learning-based multi-modal systems, undermining the reliability of cross-modality metrics like CLIPScore. In this paper, we propose CAMScore, a cyclic reference-free automatic evaluation metric for image captioning models. To circumvent the aforementioned modality gap, CAMScore utilizes a text-to-image model to generate images from captions and subsequently evaluates these generated images against the original images. Furthermore, to provide fine-grained information for a more comprehensive evaluation, we design a three-level evaluation framework for CAMScore that encompasses pixel-level, semantic-level, and objective-level perspectives. Extensive experiment results across multiple benchmark datasets show that CAMScore achieves a superior correlation with human judgments compared to existing reference-based and reference-free metrics, demonstrating the effectiveness of the framework.

Evaluating Image Caption via Cycle-consistent Text-to-Image Generation

TL;DR

CAMScore introduces a cyclic, reference-free evaluation for image captioning by generating from captions via a text-to-image model and comparing it to the original image within the same modality. It uses a three-level evaluation (pixel, semantic, object) fused through an MLP to produce a caption quality score that aligns well with human judgments. Across Flickr8k-Expert, Flickr8k-CF, Composite, and Pascal-50S, CAMScore outperforms both reference-based and prior reference-free metrics in correlation and ranking accuracy, demonstrating strong, dataset-robust alignment with human assessments. The framework offers fine-grained diagnostic insights and, despite limitations tied to detector coverage and T2I speed, provides a practical, scalable tool for automatic image-caption evaluation with potential integration into model training.

Abstract

Evaluating image captions typically relies on reference captions, which are costly to obtain and exhibit significant diversity and subjectivity. While reference-free evaluation metrics have been proposed, most focus on cross-modal evaluation between captions and images. Recent research has revealed that the modality gap generally exists in the representation of contrastive learning-based multi-modal systems, undermining the reliability of cross-modality metrics like CLIPScore. In this paper, we propose CAMScore, a cyclic reference-free automatic evaluation metric for image captioning models. To circumvent the aforementioned modality gap, CAMScore utilizes a text-to-image model to generate images from captions and subsequently evaluates these generated images against the original images. Furthermore, to provide fine-grained information for a more comprehensive evaluation, we design a three-level evaluation framework for CAMScore that encompasses pixel-level, semantic-level, and objective-level perspectives. Extensive experiment results across multiple benchmark datasets show that CAMScore achieves a superior correlation with human judgments compared to existing reference-based and reference-free metrics, demonstrating the effectiveness of the framework.
Paper Structure (22 sections, 11 equations, 5 figures, 4 tables)

This paper contains 22 sections, 11 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: (a) References fail to capture all information in the image, such as the color and position of the white dog. (b) Caption aligns with human judgment, but scores low on reference-based metrics. This discrepancy arises because different caption styles can lead to misalignment between reference-based metric scores and human judgments. (c) Modality gap in cross-modality evaluation can lead to confusion of attributes including numeracy and spatial relationships.
  • Figure 2: Overview of our proposed framework.
  • Figure 3: Illustration of our proposed evaluation metrics: (a) Calculate the pixel-by-pixel differences as pixel-level evaluation, (b) Calculate the similarity between the features of the original image and the generated image as semantic-level evaluation, (c) Detection-based object-level evaluation, taking into account both object matching and spatial relationship.
  • Figure 4: Examples of successful (a,b,c) and failed (d) cases. Except for the last non-photorealistic case, all others are from the Flickr8K dataset. The first column is the original image and the last three columns are the generated images with captions and metrics. Human judgment prefers the leftmost caption and dislikes the rightmost caption. Our metric is more consistent with human judgment.
  • Figure 5: Object detection and depth estimation result of case study. The object detector successfully boxes the object in case (a,b,c), while failing in the non-photorealistic case (d).