Table of Contents
Fetching ...

DENEB: A Hallucination-Robust Automatic Evaluation Metric for Image Captioning

Kazuki Matsuda, Yuiga Wada, Komei Sugiura

TL;DR

This work proposes DENEB, a novel supervised automatic evaluation metric specifically robust against hallucinations that incorporates the Sim-Vec Transformer, a mechanism that processes multiple references simultaneously, thereby efficiently capturing the similarity between an image, a candidate caption, and reference captions.

Abstract

In this work, we address the challenge of developing automatic evaluation metrics for image captioning, with a particular focus on robustness against hallucinations. Existing metrics are often inadequate for handling hallucinations, primarily due to their limited ability to compare candidate captions with multifaceted reference captions. To address this shortcoming, we propose DENEB, a novel supervised automatic evaluation metric specifically robust against hallucinations. DENEB incorporates the Sim-Vec Transformer, a mechanism that processes multiple references simultaneously, thereby efficiently capturing the similarity between an image, a candidate caption, and reference captions. To train DENEB, we construct the diverse and balanced Nebula dataset comprising 32,978 images, paired with human judgments provided by 805 annotators. We demonstrated that DENEB achieves state-of-the-art performance among existing LLM-free metrics on the FOIL, Composite, Flickr8K-Expert, Flickr8K-CF, Nebula, and PASCAL-50S datasets, validating its effectiveness and robustness against hallucinations.

DENEB: A Hallucination-Robust Automatic Evaluation Metric for Image Captioning

TL;DR

This work proposes DENEB, a novel supervised automatic evaluation metric specifically robust against hallucinations that incorporates the Sim-Vec Transformer, a mechanism that processes multiple references simultaneously, thereby efficiently capturing the similarity between an image, a candidate caption, and reference captions.

Abstract

In this work, we address the challenge of developing automatic evaluation metrics for image captioning, with a particular focus on robustness against hallucinations. Existing metrics are often inadequate for handling hallucinations, primarily due to their limited ability to compare candidate captions with multifaceted reference captions. To address this shortcoming, we propose DENEB, a novel supervised automatic evaluation metric specifically robust against hallucinations. DENEB incorporates the Sim-Vec Transformer, a mechanism that processes multiple references simultaneously, thereby efficiently capturing the similarity between an image, a candidate caption, and reference captions. To train DENEB, we construct the diverse and balanced Nebula dataset comprising 32,978 images, paired with human judgments provided by 805 annotators. We demonstrated that DENEB achieves state-of-the-art performance among existing LLM-free metrics on the FOIL, Composite, Flickr8K-Expert, Flickr8K-CF, Nebula, and PASCAL-50S datasets, validating its effectiveness and robustness against hallucinations.
Paper Structure (29 sections, 8 equations, 5 figures, 5 tables)

This paper contains 29 sections, 8 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Overview of Deneb. Our metric is designed to effectively evaluate hallucinated captions, which is crucial in scenarios where 'AI safety' is paramount. Unlike existing metrics such as CIDEr cider and Polos polos, which often fail to distinguish between correct and hallucinated captions, Deneb demonstrates improved robustness by assigning lower scores to hallucinated captions than correct captions.
  • Figure 2: The architecture of Deneb. CLIP and RoBERTa are employed to extract embeddings from an image, a candidate, and references. These embeddings are then processed concurrently by the Sim-Vec Transformer, comprising two modules: the Sim-Vec Extraction and transformer. Sim-Vec Extraction (SVE) utilizes a Hadamard product and element-wise differences to extract features, capturing the similarity among $\bm{x}_\mathrm{img}$, $\bm{x}_\mathrm{cand}$, and $\{\bm{x}_\mathrm{ref}^{(i)}\}_{i=1}^N$.
  • Figure 3: Qualitative results on the Nebula dataset. Panels (a) and (b) illustrate successful cases, and panel (c) depicts a failure case.
  • Figure A: Additional qualitative examples from the Nebula dataset. Existing metrics, such as CIDErcider, CLIP-Sclipscore, and Polospolos do not closely align with human evaluations. Specifically, these methods have a tendency to overestimate the quality of instances where $\bm{x}_\mathrm{cand}$ are inappropriate but contain words related to the image. In contrast, Deneb appropriately assigns lower scores to these instances, thereby demonstrating a more accurate reflection of their quality.
  • Figure B: Additional qualitative examples from the FOIL dataset. $\bm{x}_\mathrm{orig}$ and $\bm{x}_\mathrm{foil}$denote the correct candidate and the hallucinated candidate, respectively. Original words are highlighted in green, and hallucinated words are in purple. Notably, Deneb consistently assigns lower evaluation scores to hallucinated captions compared to correct captions, thereby showcasing its robustness against hallucination.