Table of Contents
Fetching ...

dopanim: A Dataset of Doppelganger Animals with Noisy Annotations from Multiple Humans

Marek Herde, Denis Huseljic, Lukas Rauch, Bernhard Sick

TL;DR

This work introduces dopanim, a dataset of about 15{,}731 animal images across 15 classes from four doppelganger groups, with ground-truth labels from iNaturalist and a rich annotation campaign providing soft, probabilistic labels from 20 annotators, plus annotator metadata and per-annotation times. It enables rigorous evaluation of multi-annotator learning methods across seven dataset variants, highlighting that modeling annotator performance and leveraging probabilistic labels improves robustness to noisy annotations, with annot-mix and related methods achieving top performance. The authors also demonstrate practical uses beyond hard labels, including leveraging annotator metadata and annotation times in active learning, and provide a full open-source codebase and data releases to support reproducible research. Despite its small scale, dopanim offers a versatile benchmark for studying learning from noisy, human-generated annotations and motivates future expansions to larger, more diverse datasets and broader methodological explorations.

Abstract

Human annotators typically provide annotated data for training machine learning models, such as neural networks. Yet, human annotations are subject to noise, impairing generalization performances. Methodological research on approaches counteracting noisy annotations requires corresponding datasets for a meaningful empirical evaluation. Consequently, we introduce a novel benchmark dataset, dopanim, consisting of about 15,750 animal images of 15 classes with ground truth labels. For approximately 10,500 of these images, 20 humans provided over 52,000 annotations with an accuracy of circa 67%. Its key attributes include (1) the challenging task of classifying doppelganger animals, (2) human-estimated likelihoods as annotations, and (3) annotator metadata. We benchmark well-known multi-annotator learning approaches using seven variants of this dataset and outline further evaluation use cases such as learning beyond hard class labels and active learning. Our dataset and a comprehensive codebase are publicly available to emulate the data collection process and to reproduce all empirical results.

dopanim: A Dataset of Doppelganger Animals with Noisy Annotations from Multiple Humans

TL;DR

This work introduces dopanim, a dataset of about 15{,}731 animal images across 15 classes from four doppelganger groups, with ground-truth labels from iNaturalist and a rich annotation campaign providing soft, probabilistic labels from 20 annotators, plus annotator metadata and per-annotation times. It enables rigorous evaluation of multi-annotator learning methods across seven dataset variants, highlighting that modeling annotator performance and leveraging probabilistic labels improves robustness to noisy annotations, with annot-mix and related methods achieving top performance. The authors also demonstrate practical uses beyond hard labels, including leveraging annotator metadata and annotation times in active learning, and provide a full open-source codebase and data releases to support reproducible research. Despite its small scale, dopanim offers a versatile benchmark for studying learning from noisy, human-generated annotations and motivates future expansions to larger, more diverse datasets and broader methodological explorations.

Abstract

Human annotators typically provide annotated data for training machine learning models, such as neural networks. Yet, human annotations are subject to noise, impairing generalization performances. Methodological research on approaches counteracting noisy annotations requires corresponding datasets for a meaningful empirical evaluation. Consequently, we introduce a novel benchmark dataset, dopanim, consisting of about 15,750 animal images of 15 classes with ground truth labels. For approximately 10,500 of these images, 20 humans provided over 52,000 annotations with an accuracy of circa 67%. Its key attributes include (1) the challenging task of classifying doppelganger animals, (2) human-estimated likelihoods as annotations, and (3) annotator metadata. We benchmark well-known multi-annotator learning approaches using seven variants of this dataset and outline further evaluation use cases such as learning beyond hard class labels and active learning. Our dataset and a comprehensive codebase are publicly available to emulate the data collection process and to reproduce all empirical results.
Paper Structure (33 sections, 17 figures, 11 tables)

This paper contains 33 sections, 17 figures, 11 tables.

Figures (17)

  • Figure 1: Simplified illustration of the data types included by dopanim -- Two of three annotators provide probabilistic labels (after normalization) to identify the animal in the image. In addition to these annotations, annotation times and annotator metadata, e.g., interest in zoology, are available.
  • Figure 2: $t$-SNE of validation images' features from a DINOv2 ViT-S/14 fine-tuned on dopanim.
  • Figure 3: Annotation interface -- Annotators adjust sliders for different classes to set their label likelihoods. These slider values represent the relative likelihood of an image belonging to a specific class compared to others. The label likelihoods' absolute values are unimportant; only their comparison matters. A label likelihood of zero indicates certainty that the image does not belong to that class. If there is uncertainty about the ground truth class, non-zero likelihoods can be set for multiple classes.
  • Figure 4: Confusion matrix across all human top-label predictions.
  • Figure 5: Mean ranks ($\downarrow$) across the seven dataset variants.
  • ...and 12 more figures