Table of Contents
Fetching ...

Cobra Effect in Reference-Free Image Captioning Metrics

Zheng Ma, Changxin Wang, Yawen Ouyang, Fei Zhao, Jianbing Zhang, Shujian Huang, Jiajun Chen

TL;DR

This paper examines the deficiencies of reference-free image captioning metrics based on visual-language models by leveraging a Cobra Effect-inspired setup where metrics serve as rewards during caption generation. It shows that optimizing for these metrics improves their scores but yields incoherent, repetitive captions, revealing latent flaws. To address this, the authors introduce Self-Improving, which uses flawed metric outputs as negative samples to retrain the metrics via contrastive learning and reapply the corrected metric as a reward, achieving improved robustness. They also propose Flaws Caption, a challenging benchmark to stress-test metrics under interference and demonstrate, with GPT-4V evaluation, state-of-the-art robustness and coherence of captions produced under the repaired metric regime. The work provides a practical pathway to strengthen metric reliability and offers a new benchmark for evaluating metric resilience in captioning tasks.

Abstract

Evaluating the compatibility between textual descriptions and corresponding images represents a core endeavor within multi-modal research. In recent years, a proliferation of reference-free methods, leveraging visual-language pre-trained models (VLMs), has emerged. Empirical evidence has substantiated that these innovative approaches exhibit a higher correlation with human judgment, marking a significant advancement in the field. However, does a higher correlation with human evaluations alone sufficiently denote the complete of a metric? In response to this question, in this paper, we study if there are any deficiencies in reference-free metrics. Specifically, inspired by the Cobra Effect, we utilize metric scores as rewards to direct the captioning model toward generating descriptions that closely align with the metric's criteria. If a certain metric has flaws, it will be exploited by the model and reflected in the generated sentences. Our findings reveal that descriptions guided by these metrics contain significant flaws, e.g. incoherent statements and excessive repetition. Subsequently, we propose a novel method termed Self-Improving to rectify the identified shortcomings within these metrics. We employ GPT-4V as an evaluative tool to assess generated sentences and the result reveals that our approach achieves state-of-the-art (SOTA) performance. In addition, we also introduce a challenging evaluation benchmark called Flaws Caption to evaluate reference-free image captioning metrics comprehensively. Our code is available at https://github.com/aaronma2020/robust_captioning_metric

Cobra Effect in Reference-Free Image Captioning Metrics

TL;DR

This paper examines the deficiencies of reference-free image captioning metrics based on visual-language models by leveraging a Cobra Effect-inspired setup where metrics serve as rewards during caption generation. It shows that optimizing for these metrics improves their scores but yields incoherent, repetitive captions, revealing latent flaws. To address this, the authors introduce Self-Improving, which uses flawed metric outputs as negative samples to retrain the metrics via contrastive learning and reapply the corrected metric as a reward, achieving improved robustness. They also propose Flaws Caption, a challenging benchmark to stress-test metrics under interference and demonstrate, with GPT-4V evaluation, state-of-the-art robustness and coherence of captions produced under the repaired metric regime. The work provides a practical pathway to strengthen metric reliability and offers a new benchmark for evaluating metric resilience in captioning tasks.

Abstract

Evaluating the compatibility between textual descriptions and corresponding images represents a core endeavor within multi-modal research. In recent years, a proliferation of reference-free methods, leveraging visual-language pre-trained models (VLMs), has emerged. Empirical evidence has substantiated that these innovative approaches exhibit a higher correlation with human judgment, marking a significant advancement in the field. However, does a higher correlation with human evaluations alone sufficiently denote the complete of a metric? In response to this question, in this paper, we study if there are any deficiencies in reference-free metrics. Specifically, inspired by the Cobra Effect, we utilize metric scores as rewards to direct the captioning model toward generating descriptions that closely align with the metric's criteria. If a certain metric has flaws, it will be exploited by the model and reflected in the generated sentences. Our findings reveal that descriptions guided by these metrics contain significant flaws, e.g. incoherent statements and excessive repetition. Subsequently, we propose a novel method termed Self-Improving to rectify the identified shortcomings within these metrics. We employ GPT-4V as an evaluative tool to assess generated sentences and the result reveals that our approach achieves state-of-the-art (SOTA) performance. In addition, we also introduce a challenging evaluation benchmark called Flaws Caption to evaluate reference-free image captioning metrics comprehensively. Our code is available at https://github.com/aaronma2020/robust_captioning_metric
Paper Structure (20 sections, 5 equations, 5 figures, 6 tables)

This paper contains 20 sections, 5 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Top: Ideal state of the sentence generation process; Bottom: real situation of the sentence generation process.
  • Figure 2: Flowchart of evaluation Reference-free metrics. PT Stage: the model is optimized using cross-entropy loss. RL Stage: we employ two distinct decoding strategies: greedy decoding, resulting in T$_{g}$, and sampling decoding, producing T$_{s}$. The objective is to evaluate the difference between these two generated sentences in terms of the specified metric.
  • Figure 3: Pytorch-like pseudocode for the core of an implementation of Self-Improving based on CLIP.
  • Figure 4: The prompt of the GPT-4V evaluator.
  • Figure 5: The results of GPT-4V evaluation.