Table of Contents
Fetching ...

CLAIR: Evaluating Image Captions with Large Language Models

David Chan, Suzanne Petryk, Joseph E. Gonzalez, Trevor Darrell, John Canny

TL;DR

This paper presents CLAIR, an LLM-based zero-shot method for evaluating image captions by assessing whether a candidate caption conveys the same content as reference captions, producing a numeric score plus an explanation. CLAIR leverages zero-shot prompting and an ensemble (CLAIR_E) across multiple LLMs to improve alignment with human judgments, outperforming traditional metrics on several benchmarks. Empirical results across MS-COCO, Flickr8K-Expert, PASCAL-50S, and COCO-Sets demonstrate strong correlations with human preferences and competitive discriminative capabilities, with interpretability as a key advantage. The work highlights a new direction in vision-language evaluation, underscoring the potential of language-only models to assess multimodal content and inviting broader application to related tasks like visual storytelling.

Abstract

The evaluation of machine-generated image captions poses an interesting yet persistent challenge. Effective evaluation measures must consider numerous dimensions of similarity, including semantic relevance, visual structure, object interactions, caption diversity, and specificity. Existing highly-engineered measures attempt to capture specific aspects, but fall short in providing a holistic score that aligns closely with human judgments. Here, we propose CLAIR, a novel method that leverages the zero-shot language modeling capabilities of large language models (LLMs) to evaluate candidate captions. In our evaluations, CLAIR demonstrates a stronger correlation with human judgments of caption quality compared to existing measures. Notably, on Flickr8K-Expert, CLAIR achieves relative correlation improvements over SPICE of 39.6% and over image-augmented methods such as RefCLIP-S of 18.3%. Moreover, CLAIR provides noisily interpretable results by allowing the language model to identify the underlying reasoning behind its assigned score. Code is available at https://davidmchan.github.io/clair/

CLAIR: Evaluating Image Captions with Large Language Models

TL;DR

This paper presents CLAIR, an LLM-based zero-shot method for evaluating image captions by assessing whether a candidate caption conveys the same content as reference captions, producing a numeric score plus an explanation. CLAIR leverages zero-shot prompting and an ensemble (CLAIR_E) across multiple LLMs to improve alignment with human judgments, outperforming traditional metrics on several benchmarks. Empirical results across MS-COCO, Flickr8K-Expert, PASCAL-50S, and COCO-Sets demonstrate strong correlations with human preferences and competitive discriminative capabilities, with interpretability as a key advantage. The work highlights a new direction in vision-language evaluation, underscoring the potential of language-only models to assess multimodal content and inviting broader application to related tasks like visual storytelling.

Abstract

The evaluation of machine-generated image captions poses an interesting yet persistent challenge. Effective evaluation measures must consider numerous dimensions of similarity, including semantic relevance, visual structure, object interactions, caption diversity, and specificity. Existing highly-engineered measures attempt to capture specific aspects, but fall short in providing a holistic score that aligns closely with human judgments. Here, we propose CLAIR, a novel method that leverages the zero-shot language modeling capabilities of large language models (LLMs) to evaluate candidate captions. In our evaluations, CLAIR demonstrates a stronger correlation with human judgments of caption quality compared to existing measures. Notably, on Flickr8K-Expert, CLAIR achieves relative correlation improvements over SPICE of 39.6% and over image-augmented methods such as RefCLIP-S of 18.3%. Moreover, CLAIR provides noisily interpretable results by allowing the language model to identify the underlying reasoning behind its assigned score. Code is available at https://davidmchan.github.io/clair/
Paper Structure (16 sections, 2 figures, 4 tables)

This paper contains 16 sections, 2 figures, 4 tables.

Figures (2)

  • Figure 1: CLAIR: a (surprisingly simple) large language model-based measure for image caption evaluation. We find that CLAIR not only correlates strongly with human judgments of caption quality but can also generate interpretable reasons for the generated scores.
  • Figure 2: Several qualitative examples of CLAIR from the Flickr8K-Expert dataset. CLAIR not only correlates better with human judgments of caption quality but also provides detailed explanations for its score. CLAIR scores normalized by 100.