Polos: Multimodal Metric Learning from Human Feedback for Image Captioning
Yuiga Wada, Kanta Kaneda, Daichi Saito, Komei Sugiura
TL;DR
Polos tackles the mismatch between automatic image captioning evaluation and human judgments by introducing a supervised, multimodal metric learned from human feedback. It combines CLIP-based vision-language features with SimCSE-finetuned RoBERTa sentence embeddings in a parallel feature extraction framework, trained under the Multimodal Metric Learning from Human Feedback ($M^2LHF$) paradigm. Trained on the Polaris dataset of 131K judgments from 550 evaluators, Polos achieves state-of-the-art correlations across diverse benchmarks and demonstrates strong zero-shot robustness, including on FOIL and Pascal-50S. Overall, the work provides a practical, robust metric that better reflects human judgments for caption quality and hallucination handling, with potential for broad adoption in image captioning evaluation and model development.
Abstract
Establishing an automatic evaluation metric that closely aligns with human judgments is essential for effectively developing image captioning models. Recent data-driven metrics have demonstrated a stronger correlation with human judgments than classic metrics such as CIDEr; however they lack sufficient capabilities to handle hallucinations and generalize across diverse images and texts partially because they compute scalar similarities merely using embeddings learned from tasks unrelated to image captioning evaluation. In this study, we propose Polos, a supervised automatic evaluation metric for image captioning models. Polos computes scores from multimodal inputs, using a parallel feature extraction mechanism that leverages embeddings trained through large-scale contrastive learning. To train Polos, we introduce Multimodal Metric Learning from Human Feedback (M$^2$LHF), a framework for developing metrics based on human feedback. We constructed the Polaris dataset, which comprises 131K human judgments from 550 evaluators, which is approximately ten times larger than standard datasets. Our approach achieved state-of-the-art performance on Composite, Flickr8K-Expert, Flickr8K-CF, PASCAL-50S, FOIL, and the Polaris dataset, thereby demonstrating its effectiveness and robustness.
