HICEScore: A Hierarchical Metric for Image Captioning Evaluation

Zequn Zeng; Jianqiao Sun; Hao Zhang; Tiansheng Wen; Yudi Su; Yan Xie; Zhengjue Wang; Bo Chen

HICEScore: A Hierarchical Metric for Image Captioning Evaluation

Zequn Zeng, Jianqiao Sun, Hao Zhang, Tiansheng Wen, Yudi Su, Yan Xie, Zhengjue Wang, Bo Chen

TL;DR

HICE-S addresses the gap in image captioning evaluation by delivering a hierarchical, reference-free metric that combines global image-text compatibility with local region-phrase completeness. It uses Alpha-CLIP to compute global ITC and TTC, and constructs local representations from semantic regions and textual phrases to obtain $lITC$ and $lTTC$, which are fused via a harmonic mean to form $\mathrm{HICE}(I, C)$. The method extends to RefHICE-S when references are available by including global and local TTC components, enabling strong correlations with human judgments, robust caption ranking, and effective detection of object hallucinations. Extensive experiments across multiple benchmarks demonstrate SOTA performance for both HICE-S and RefHICE-S, with ablations confirming the value of the hierarchical design and localized analysis for interpretability and accuracy.

Abstract

Image captioning evaluation metrics can be divided into two categories, reference-based metrics and reference-free metrics. However, reference-based approaches may struggle to evaluate descriptive captions with abundant visual details produced by advanced multimodal large language models, due to their heavy reliance on limited human-annotated references. In contrast, previous reference-free metrics have been proven effective via CLIP cross-modality similarity. Nonetheless, CLIP-based metrics, constrained by their solution of global image-text compatibility, often have a deficiency in detecting local textual hallucinations and are insensitive to small visual objects. Besides, their single-scale designs are unable to provide an interpretable evaluation process such as pinpointing the position of caption mistakes and identifying visual regions that have not been described. To move forward, we propose a novel reference-free metric for image captioning evaluation, dubbed Hierarchical Image Captioning Evaluation Score (HICE-S). By detecting local visual regions and textual phrases, HICE-S builds an interpretable hierarchical scoring mechanism, breaking through the barriers of the single-scale structure of existing reference-free metrics. Comprehensive experiments indicate that our proposed metric achieves the SOTA performance on several benchmarks, outperforming existing reference-free metrics like CLIP-S and PAC-S, and reference-based metrics like METEOR and CIDEr. Moreover, several case studies reveal that the assessment process of HICE-S on detailed captions closely resembles interpretable human judgments.Our code is available at https://github.com/joeyz0z/HICE.

HICEScore: A Hierarchical Metric for Image Captioning Evaluation

TL;DR

and

, which are fused via a harmonic mean to form

. The method extends to RefHICE-S when references are available by including global and local TTC components, enabling strong correlations with human judgments, robust caption ranking, and effective detection of object hallucinations. Extensive experiments across multiple benchmarks demonstrate SOTA performance for both HICE-S and RefHICE-S, with ablations confirming the value of the hierarchical design and localized analysis for interpretability and accuracy.

Abstract

Paper Structure (22 sections, 7 equations, 7 figures, 6 tables)

This paper contains 22 sections, 7 equations, 7 figures, 6 tables.

Introduction
Related Work
Reference-based metric
Reference-free metric
Hierarchical Image Captioning Evaluation
Problem formulation and preliminaries
Hierarchical evaluation
Global evaluation
Local evaluation
Reference-free metric: HICE-S
Reference-based metric: RefHICE-S
Experiments
Implementation details
Correlation with human judgments
Caption pairwise ranking
...and 7 more sections

Figures (7)

Figure 1: Comparisons on evaluation scores of various metrics when assessing (a) brief captions or (b) detailed captions, where correct descriptions about small objects are highlighted in blue and incorrect hallucinations are in golden. The evaluation scores that agree with human judgments are highlighted in green and disagree in red.
Figure 2: An illustration of our proposed HICEScore. Left: global image-caption compatibility $\mathrm{gITC}$ and reference-caption compatibility $\mathrm{gTTC}$. Right: local image-caption compatibility $l\mathrm{ITC}$ and reference-caption compatibility $l\mathrm{TTC}$
Figure 3: An illustration of local evaluation including $l$ITC (left part, Eq. \ref{['eq: lITC']}) and $l$TTC (right part, Eq. \ref{['eq: lTTC']}), where both the precision $\mathrm{P}$ and recall $\mathrm{R}$ are computed before obtaining the final fusion score through the harmonic mean $\mathrm{hMean}(\cdot, \cdot)$.
Figure 3: Different scores of previous SOTA captioning models on COCO testing dataset lin2014microsoft.
Figure 4: Precision and recall evaluation scores of HICE-S, InfoMetIC compared to human correctness and completeness scores. InfoMetIC scores are normalized from 0 to 1 for better view.
...and 2 more figures

HICEScore: A Hierarchical Metric for Image Captioning Evaluation

TL;DR

Abstract

HICEScore: A Hierarchical Metric for Image Captioning Evaluation

Authors

TL;DR

Abstract

Table of Contents

Figures (7)