FLEUR: An Explainable Reference-Free Evaluation Metric for Image Captioning Using a Large Multimodal Model

Yebin Lee; Imseong Park; Myungjoo Kang

FLEUR: An Explainable Reference-Free Evaluation Metric for Image Captioning Using a Large Multimodal Model

Yebin Lee, Imseong Park, Myungjoo Kang

TL;DR

FLEUR addresses the need for an explainable, reference-free image captioning evaluator by leveraging a large multimodal model to score captions directly against images and generate human-readable explanations. It introduces score smoothing and grading-criteria prompts to align scores with human judgment and improve granularity, achieving state-of-the-art correlations among reference-free metrics and competitive results against reference-based baselines. The framework supports a RefFLEUR variant that can incorporate references, and it demonstrates robustness against object hallucination while providing interpretable explanations that can inform model development. The work also analyzes model-size effects, prompt design, and inference-time considerations, and releases open-source code for replication and extension.

Abstract

Most existing image captioning evaluation metrics focus on assigning a single numerical score to a caption by comparing it with reference captions. However, these methods do not provide an explanation for the assigned score. Moreover, reference captions are expensive to acquire. In this paper, we propose FLEUR, an explainable reference-free metric to introduce explainability into image captioning evaluation metrics. By leveraging a large multimodal model, FLEUR can evaluate the caption against the image without the need for reference captions, and provide the explanation for the assigned score. We introduce score smoothing to align as closely as possible with human judgment and to be robust to user-defined grading criteria. FLEUR achieves high correlations with human judgment across various image captioning evaluation benchmarks and reaches state-of-the-art results on Flickr8k-CF, COMPOSITE, and Pascal-50S within the domain of reference-free evaluation metrics. Our source code and results are publicly available at: https://github.com/Yebin46/FLEUR.

FLEUR: An Explainable Reference-Free Evaluation Metric for Image Captioning Using a Large Multimodal Model

TL;DR

Abstract

Paper Structure (41 sections, 5 equations, 7 figures, 8 tables)

This paper contains 41 sections, 5 equations, 7 figures, 8 tables.

Introduction
Related Works
Image Captioning Evaluation Metrics
LLM-based Evaluation Metrics
Metrics for NLG evaluation
Metrics for caption evaluation
Method
Prompt for Caption Evaluation
Base instruction
Grading criteria
Score Smoothing
RefFLEUR
Prompt for Explanation
Experiments
Correlations with Human Judgment
...and 26 more sections

Figures (7)

Figure 1: Top: Comparison between other non-explainable metrics and our explainable metric, FLEUR. FLEUR provides the explanation for the assigned score as well. Bottom: Existing explainable metric cannot consider the image. The information highlighted in red in the candidate caption is not present in the reference caption set, causing confusion for that metric.
Figure 2: The overall framework of FLEUR. Left: When feeding LLaVA with the prompt containing the grading criteria, image, and the candidate caption for evaluation, FLEUR takes a weighted sum of probabilities of tokens (0 to 9) as the final score. Right: When prompted by the user for the rationale behind the given score, FLEUR provides explanations in a language understandable to humans.
Figure 3: Comparison between the explanation of FLEUR and the explanation of CLAIR. The parts highlighted in red indicate inaccuracies in the explanation. Note that in these examples, FLEUR does not use a reference caption set as input. For spatial reasons, the explanations have been omitted with '...' symbols. The omitted part can be found in Appendix \ref{['examples']}.
Figure 4: Examples of a FLEUR score and a raw score for the same image-candidate caption pair, along with explanations for each score. The parts highlighted in red indicate incorrect captions and incorrect explanations, while the parts marked in green signify correct explanations. For spatial reasons, the explanation has been omitted with '...' symbols. The omitted part can be found in Appendix \ref{['examples']}.
Figure 5: (a) Ablation study based on the scores included in the grading criteria. (b) Effect of directly obtaining probabilities.
...and 2 more figures

FLEUR: An Explainable Reference-Free Evaluation Metric for Image Captioning Using a Large Multimodal Model

TL;DR

Abstract

FLEUR: An Explainable Reference-Free Evaluation Metric for Image Captioning Using a Large Multimodal Model

Authors

TL;DR

Abstract

Table of Contents

Figures (7)