Evaluating Image Caption via Cycle-consistent Text-to-Image Generation
Tianyu Cui, Jinbin Bai, Guo-Hua Wang, Qing-Guo Chen, Zhao Xu, Weihua Luo, Kaifu Zhang, Ye Shi
TL;DR
CAMScore introduces a cyclic, reference-free evaluation for image captioning by generating $I_{gen}$ from captions via a text-to-image model and comparing it to the original image within the same modality. It uses a three-level evaluation (pixel, semantic, object) fused through an MLP to produce a caption quality score that aligns well with human judgments. Across Flickr8k-Expert, Flickr8k-CF, Composite, and Pascal-50S, CAMScore outperforms both reference-based and prior reference-free metrics in correlation and ranking accuracy, demonstrating strong, dataset-robust alignment with human assessments. The framework offers fine-grained diagnostic insights and, despite limitations tied to detector coverage and T2I speed, provides a practical, scalable tool for automatic image-caption evaluation with potential integration into model training.
Abstract
Evaluating image captions typically relies on reference captions, which are costly to obtain and exhibit significant diversity and subjectivity. While reference-free evaluation metrics have been proposed, most focus on cross-modal evaluation between captions and images. Recent research has revealed that the modality gap generally exists in the representation of contrastive learning-based multi-modal systems, undermining the reliability of cross-modality metrics like CLIPScore. In this paper, we propose CAMScore, a cyclic reference-free automatic evaluation metric for image captioning models. To circumvent the aforementioned modality gap, CAMScore utilizes a text-to-image model to generate images from captions and subsequently evaluates these generated images against the original images. Furthermore, to provide fine-grained information for a more comprehensive evaluation, we design a three-level evaluation framework for CAMScore that encompasses pixel-level, semantic-level, and objective-level perspectives. Extensive experiment results across multiple benchmark datasets show that CAMScore achieves a superior correlation with human judgments compared to existing reference-based and reference-free metrics, demonstrating the effectiveness of the framework.
