Table of Contents
Fetching ...

Polos: Multimodal Metric Learning from Human Feedback for Image Captioning

Yuiga Wada, Kanta Kaneda, Daichi Saito, Komei Sugiura

TL;DR

Polos tackles the mismatch between automatic image captioning evaluation and human judgments by introducing a supervised, multimodal metric learned from human feedback. It combines CLIP-based vision-language features with SimCSE-finetuned RoBERTa sentence embeddings in a parallel feature extraction framework, trained under the Multimodal Metric Learning from Human Feedback ($M^2LHF$) paradigm. Trained on the Polaris dataset of 131K judgments from 550 evaluators, Polos achieves state-of-the-art correlations across diverse benchmarks and demonstrates strong zero-shot robustness, including on FOIL and Pascal-50S. Overall, the work provides a practical, robust metric that better reflects human judgments for caption quality and hallucination handling, with potential for broad adoption in image captioning evaluation and model development.

Abstract

Establishing an automatic evaluation metric that closely aligns with human judgments is essential for effectively developing image captioning models. Recent data-driven metrics have demonstrated a stronger correlation with human judgments than classic metrics such as CIDEr; however they lack sufficient capabilities to handle hallucinations and generalize across diverse images and texts partially because they compute scalar similarities merely using embeddings learned from tasks unrelated to image captioning evaluation. In this study, we propose Polos, a supervised automatic evaluation metric for image captioning models. Polos computes scores from multimodal inputs, using a parallel feature extraction mechanism that leverages embeddings trained through large-scale contrastive learning. To train Polos, we introduce Multimodal Metric Learning from Human Feedback (M$^2$LHF), a framework for developing metrics based on human feedback. We constructed the Polaris dataset, which comprises 131K human judgments from 550 evaluators, which is approximately ten times larger than standard datasets. Our approach achieved state-of-the-art performance on Composite, Flickr8K-Expert, Flickr8K-CF, PASCAL-50S, FOIL, and the Polaris dataset, thereby demonstrating its effectiveness and robustness.

Polos: Multimodal Metric Learning from Human Feedback for Image Captioning

TL;DR

Polos tackles the mismatch between automatic image captioning evaluation and human judgments by introducing a supervised, multimodal metric learned from human feedback. It combines CLIP-based vision-language features with SimCSE-finetuned RoBERTa sentence embeddings in a parallel feature extraction framework, trained under the Multimodal Metric Learning from Human Feedback () paradigm. Trained on the Polaris dataset of 131K judgments from 550 evaluators, Polos achieves state-of-the-art correlations across diverse benchmarks and demonstrates strong zero-shot robustness, including on FOIL and Pascal-50S. Overall, the work provides a practical, robust metric that better reflects human judgments for caption quality and hallucination handling, with potential for broad adoption in image captioning evaluation and model development.

Abstract

Establishing an automatic evaluation metric that closely aligns with human judgments is essential for effectively developing image captioning models. Recent data-driven metrics have demonstrated a stronger correlation with human judgments than classic metrics such as CIDEr; however they lack sufficient capabilities to handle hallucinations and generalize across diverse images and texts partially because they compute scalar similarities merely using embeddings learned from tasks unrelated to image captioning evaluation. In this study, we propose Polos, a supervised automatic evaluation metric for image captioning models. Polos computes scores from multimodal inputs, using a parallel feature extraction mechanism that leverages embeddings trained through large-scale contrastive learning. To train Polos, we introduce Multimodal Metric Learning from Human Feedback (MLHF), a framework for developing metrics based on human feedback. We constructed the Polaris dataset, which comprises 131K human judgments from 550 evaluators, which is approximately ten times larger than standard datasets. Our approach achieved state-of-the-art performance on Composite, Flickr8K-Expert, Flickr8K-CF, PASCAL-50S, FOIL, and the Polaris dataset, thereby demonstrating its effectiveness and robustness.
Paper Structure (38 sections, 5 equations, 16 figures, 7 tables)

This paper contains 38 sections, 5 equations, 16 figures, 7 tables.

Figures (16)

  • Figure 1: Our supervised metric Polos computes evaluation scores from multimodal inputs by integrating human feedback within the novel framework $\mathrm{M^2LHF}$. Polos is capable of modeling intricate relationships within the vector space of text-image pairs as well as text-text pairs, thereby effectively evaluating the depicted samples.
  • Figure 2: Overview of the proposed metric. In alignment with the principles of $\mathrm{M^2LHF}$, Polos computes the evaluation $\hat{y}$ based on multimodal inputs and regresses the human evaluation. The proposed metric extracts effective features for caption evaluation using the difference and Hadamard product of features derived from both CLIP and RoBERTa.
  • Figure 3: Examples of successful and failed cases from the Polaris dataset. Values in blue indicate critical errors, and values in red represent those closest to the human judgments. Underlined words indicate significant inaccuracies in the candidate captions. These results demonstrate that our proposed metric effectively handled multimodal inputs and yielded evaluation scores that aligned closely with human judgment. Note that $\bm{x}_\mathrm{ref}^{(1)}$ and $\bm{x}_\mathrm{cand}$ denote one of the reference captions and the candidate caption, respectively.
  • Figure 4: Score distributions of human judgments in Composite, Flickr8K-Expert, Flickr8K-CF, and our Polaris dataset. All scores were normalized from 0 to 1. Polaris distinguishes itself from other datasets by encompassing a vast collection of captions and integrating a broad spectrum of human judgments.
  • Figure 5: Additional examples from the Polaris dataset (the blue blocks indicate critical errors and the underlined words represent noteworthy features.) The CLIPScore family tends to overestimate scores. Specifically, reference-with-image metrics such as RefCLIP-S and RefPAC-S may not effectively compare references and a candidate. CLIP-S does not exhibit a tendency to overestimate; however, this does not necessarily imply that it estimates captions adequately. Rather, it may indicate a deficiency in its estimation capabilities, particularly for longer captions. This limitation likely stems from poor alignment between words and images in extended captions, as CLIP heavily relies on the alignment between image and language features.
  • ...and 11 more figures