Table of Contents
Fetching ...

RePOPE: Impact of Annotation Errors on the POPE Benchmark

Yannic Neuhaus, Matthias Hein

TL;DR

This work investigates how annotation noise in MSCOCO affects the POPE object-hallucination benchmark. It introduces RePOPE, a corrected label set obtained by re-annotating 500 images and labeling objects as $Yes$, $No$, or $Ambiguous$, with ambiguous cases removed. Evaluating a diverse set of open-weight vision-language models on POPE and RePOPE reveals substantial shifts in $F_1$-based rankings due to relabeling, driven by changes in $TP$ and $FP$. The findings underscore the critical role of data quality in benchmarking and provide a more robust evaluation resource, alongside a call for complementary benchmarks.

Abstract

Since data annotation is costly, benchmark datasets often incorporate labels from established image datasets. In this work, we assess the impact of label errors in MSCOCO on the frequently used object hallucination benchmark POPE. We re-annotate the benchmark images and identify an imbalance in annotation errors across different subsets. Evaluating multiple models on the revised labels, which we denote as RePOPE, we observe notable shifts in model rankings, highlighting the impact of label quality. Code and data are available at https://github.com/YanNeu/RePOPE .

RePOPE: Impact of Annotation Errors on the POPE Benchmark

TL;DR

This work investigates how annotation noise in MSCOCO affects the POPE object-hallucination benchmark. It introduces RePOPE, a corrected label set obtained by re-annotating 500 images and labeling objects as , , or , with ambiguous cases removed. Evaluating a diverse set of open-weight vision-language models on POPE and RePOPE reveals substantial shifts in -based rankings due to relabeling, driven by changes in and . The findings underscore the critical role of data quality in benchmarking and provide a more robust evaluation resource, alongside a call for complementary benchmarks.

Abstract

Since data annotation is costly, benchmark datasets often incorporate labels from established image datasets. In this work, we assess the impact of label errors in MSCOCO on the frequently used object hallucination benchmark POPE. We re-annotate the benchmark images and identify an imbalance in annotation errors across different subsets. Evaluating multiple models on the revised labels, which we denote as RePOPE, we observe notable shifts in model rankings, highlighting the impact of label quality. Code and data are available at https://github.com/YanNeu/RePOPE .

Paper Structure

This paper contains 10 sections, 4 figures, 9 tables.

Figures (4)

  • Figure 1: RePOPE annotation examples The first row displays images that do not contain the object but are incorrectly labeled as “Yes” in POPE. The second row shows images where the object is present but mistakenly labeled as “No.” The object's presence is highlighted with a red box. The third row illustrates cases of inconsistent labeling in POPE, which we annotate as “ambiguous.” Examples include a “teddy bear” being categorized as a bear, a motorcycle being considered a motorized bicycle, and airport vehicles being classified as cars. Since these categorizations are subjective and MSCOCO labels are inconsistent, we exclude such cases from the benchmark.
  • Figure 2: POPE vs RePOPE: Due to the high error rate on the positive labels, the number of TP is significantly reduced across all models. Regarding FP, we observe different patterns across the three subsets: on the random subset, the number of false positives almost doubles for most of the models, results on popular are relatively stable, and on adversarial we observe a slight reduction of false positives. The ranking according to the F1 score is heavily impacted by the relabeling. The top models (Ovis2-4B/-8B) on the popular and adversarial split for POPE, also achieve the top ranks on random for RePOPE.
  • Figure 3: POPE vs. RePOPE: Precision and Recall The models' precision decreases on RePOPE while the true positive rate (TPR) improves. Effects on the true negative rate (TNR) are small but sufficient to change the ranking, and due to the larger amount of label errors in the positive questions, the models' yes rates decrease on the relabeled dataset.
  • Figure 4: POPE vs. RePOPE: Accuracy We observe a similar pattern as for the F1 score. However, note that for RePOPE the number of positive and negative samples is not balanced anymore. Thus, accuracy needs to be interpreted with care.