RePOPE: Impact of Annotation Errors on the POPE Benchmark
Yannic Neuhaus, Matthias Hein
TL;DR
This work investigates how annotation noise in MSCOCO affects the POPE object-hallucination benchmark. It introduces RePOPE, a corrected label set obtained by re-annotating 500 images and labeling objects as $Yes$, $No$, or $Ambiguous$, with ambiguous cases removed. Evaluating a diverse set of open-weight vision-language models on POPE and RePOPE reveals substantial shifts in $F_1$-based rankings due to relabeling, driven by changes in $TP$ and $FP$. The findings underscore the critical role of data quality in benchmarking and provide a more robust evaluation resource, alongside a call for complementary benchmarks.
Abstract
Since data annotation is costly, benchmark datasets often incorporate labels from established image datasets. In this work, we assess the impact of label errors in MSCOCO on the frequently used object hallucination benchmark POPE. We re-annotate the benchmark images and identify an imbalance in annotation errors across different subsets. Evaluating multiple models on the revised labels, which we denote as RePOPE, we observe notable shifts in model rankings, highlighting the impact of label quality. Code and data are available at https://github.com/YanNeu/RePOPE .
