dopanim: A Dataset of Doppelganger Animals with Noisy Annotations from Multiple Humans
Marek Herde, Denis Huseljic, Lukas Rauch, Bernhard Sick
TL;DR
This work introduces dopanim, a dataset of about 15{,}731 animal images across 15 classes from four doppelganger groups, with ground-truth labels from iNaturalist and a rich annotation campaign providing soft, probabilistic labels from 20 annotators, plus annotator metadata and per-annotation times. It enables rigorous evaluation of multi-annotator learning methods across seven dataset variants, highlighting that modeling annotator performance and leveraging probabilistic labels improves robustness to noisy annotations, with annot-mix and related methods achieving top performance. The authors also demonstrate practical uses beyond hard labels, including leveraging annotator metadata and annotation times in active learning, and provide a full open-source codebase and data releases to support reproducible research. Despite its small scale, dopanim offers a versatile benchmark for studying learning from noisy, human-generated annotations and motivates future expansions to larger, more diverse datasets and broader methodological explorations.
Abstract
Human annotators typically provide annotated data for training machine learning models, such as neural networks. Yet, human annotations are subject to noise, impairing generalization performances. Methodological research on approaches counteracting noisy annotations requires corresponding datasets for a meaningful empirical evaluation. Consequently, we introduce a novel benchmark dataset, dopanim, consisting of about 15,750 animal images of 15 classes with ground truth labels. For approximately 10,500 of these images, 20 humans provided over 52,000 annotations with an accuracy of circa 67%. Its key attributes include (1) the challenging task of classifying doppelganger animals, (2) human-estimated likelihoods as annotations, and (3) annotator metadata. We benchmark well-known multi-annotator learning approaches using seven variants of this dataset and outline further evaluation use cases such as learning beyond hard class labels and active learning. Our dataset and a comprehensive codebase are publicly available to emulate the data collection process and to reproduce all empirical results.
