Table of Contents
Fetching ...

BRIDGE: Bridging Gaps in Image Captioning Evaluation with Stronger Visual Cues

Sara Sarto, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara

TL;DR

BRIDGE introduces a novel, reference-free image captioning evaluation metric that tightly integrates visual grounding into the scoring process. By constructing multimodal pseudo-captions through a mapping module that enriches template captions with fine-grained visual features, BRIDGE achieves stronger alignment with human judgments than prior reference-free metrics. The approach combines a dual-encoder backbone, weighted contrastive losses, and a CLIP-based inference that fuses global and localized visual-textual evidence, and it demonstrates state-of-the-art correlation across several datasets, including enhanced detection of caption hallucinations. The method is validated on COCO-derived data, shows robustness to template quality, and scales across traditional and large-language-model-based captioners, with code and models publicly available.

Abstract

Effectively aligning with human judgment when evaluating machine-generated image captions represents a complex yet intriguing challenge. Existing evaluation metrics like CIDEr or CLIP-Score fall short in this regard as they do not take into account the corresponding image or lack the capability of encoding fine-grained details and penalizing hallucinations. To overcome these issues, in this paper, we propose BRIDGE, a new learnable and reference-free image captioning metric that employs a novel module to map visual features into dense vectors and integrates them into multi-modal pseudo-captions which are built during the evaluation process. This approach results in a multimodal metric that properly incorporates information from the input image without relying on reference captions, bridging the gap between human judgment and machine-generated image captions. Experiments spanning several datasets demonstrate that our proposal achieves state-of-the-art results compared to existing reference-free evaluation scores. Our source code and trained models are publicly available at: https://github.com/aimagelab/bridge-score.

BRIDGE: Bridging Gaps in Image Captioning Evaluation with Stronger Visual Cues

TL;DR

BRIDGE introduces a novel, reference-free image captioning evaluation metric that tightly integrates visual grounding into the scoring process. By constructing multimodal pseudo-captions through a mapping module that enriches template captions with fine-grained visual features, BRIDGE achieves stronger alignment with human judgments than prior reference-free metrics. The approach combines a dual-encoder backbone, weighted contrastive losses, and a CLIP-based inference that fuses global and localized visual-textual evidence, and it demonstrates state-of-the-art correlation across several datasets, including enhanced detection of caption hallucinations. The method is validated on COCO-derived data, shows robustness to template quality, and scales across traditional and large-language-model-based captioners, with code and models publicly available.

Abstract

Effectively aligning with human judgment when evaluating machine-generated image captions represents a complex yet intriguing challenge. Existing evaluation metrics like CIDEr or CLIP-Score fall short in this regard as they do not take into account the corresponding image or lack the capability of encoding fine-grained details and penalizing hallucinations. To overcome these issues, in this paper, we propose BRIDGE, a new learnable and reference-free image captioning metric that employs a novel module to map visual features into dense vectors and integrates them into multi-modal pseudo-captions which are built during the evaluation process. This approach results in a multimodal metric that properly incorporates information from the input image without relying on reference captions, bridging the gap between human judgment and machine-generated image captions. Experiments spanning several datasets demonstrate that our proposal achieves state-of-the-art results compared to existing reference-free evaluation scores. Our source code and trained models are publicly available at: https://github.com/aimagelab/bridge-score.
Paper Structure (16 sections, 7 equations, 8 figures, 8 tables)

This paper contains 16 sections, 7 equations, 8 figures, 8 tables.

Figures (8)

  • Figure 1: Comparison between different captioning evaluation approaches: (a) CIDEr vedantam2015cider scores candidate and reference captions without considering the input image; (b) CLIP-Score hessel2021clipscore compares text and images using global vectors in a shared embedding space; (c) our BRIDGE, internally builds multimodal pseudo-captions by translating fine-grained image features into pseudo-tokens thanks to a mapping module.
  • Figure 2: Overview of the BRIDGE evaluation approach. Starting from a template caption, a mapping network augments it with dense visual features, obtaining a pseudo-caption which is then used for computing image-text similarities.
  • Figure 3: COCO captions with template captions and associated noun chunks.
  • Figure 4: Sample images from the FOIL dataset shekhar2017foil and corresponding scores generated by our proposed metric compared with CLIP-S and PAC-S.
  • Figure 5: Metric scores for top-$k$ detections ranked by probability (left) and as a function of detection area (right).
  • ...and 3 more figures