Table of Contents
Fetching ...

Ambiguous Annotations: When is a Pedestrian not a Pedestrian?

Luisa Schwirten, Jannes Scholz, Daniel Kondermann, Janis Keuper

TL;DR

This paper tackles ambiguity in human annotations for autonomous driving data, focusing on pedestrian labeling where true truth is not always observable. It introduces an ambiguity score based on annotator disagreement and demonstrates that pruning highly ambiguous training instances can yield better performance (lower Log Average Miss Rate) and higher precision/F1, while also reducing annotation costs. The study reveals a strong correlation between ambiguity and occlusion, explaining why removal can help non-occluded cases but may affect occluded evaluations differently. Practical guidelines are proposed for diagnosing ambiguity, estimating it cost-efficiently, and pruning data to improve efficiency without compromising dataset representativeness, with a pathway for extending the approach to more classes and measures in future work.

Abstract

Datasets labelled by human annotators are widely used in the training and testing of machine learning models. In recent years, researchers are increasingly paying attention to label quality. However, it is not always possible to objectively determine whether an assigned label is correct or not. The present work investigates this ambiguity in the annotation of autonomous driving datasets as an important dimension of data quality. Our experiments show that excluding highly ambiguous data from the training improves model performance of a state-of-the-art pedestrian detector in terms of LAMR, precision and F1 score, thereby saving training time and annotation costs. Furthermore, we demonstrate that, in order to safely remove ambiguous instances and ensure the retained representativeness of the training data, an understanding of the properties of the dataset and class under investigation is crucial.

Ambiguous Annotations: When is a Pedestrian not a Pedestrian?

TL;DR

This paper tackles ambiguity in human annotations for autonomous driving data, focusing on pedestrian labeling where true truth is not always observable. It introduces an ambiguity score based on annotator disagreement and demonstrates that pruning highly ambiguous training instances can yield better performance (lower Log Average Miss Rate) and higher precision/F1, while also reducing annotation costs. The study reveals a strong correlation between ambiguity and occlusion, explaining why removal can help non-occluded cases but may affect occluded evaluations differently. Practical guidelines are proposed for diagnosing ambiguity, estimating it cost-efficiently, and pruning data to improve efficiency without compromising dataset representativeness, with a pathway for extending the approach to more classes and measures in future work.

Abstract

Datasets labelled by human annotators are widely used in the training and testing of machine learning models. In recent years, researchers are increasingly paying attention to label quality. However, it is not always possible to objectively determine whether an assigned label is correct or not. The present work investigates this ambiguity in the annotation of autonomous driving datasets as an important dimension of data quality. Our experiments show that excluding highly ambiguous data from the training improves model performance of a state-of-the-art pedestrian detector in terms of LAMR, precision and F1 score, thereby saving training time and annotation costs. Furthermore, we demonstrate that, in order to safely remove ambiguous instances and ensure the retained representativeness of the training data, an understanding of the properties of the dataset and class under investigation is crucial.
Paper Structure (16 sections, 2 equations, 4 figures, 1 table)

This paper contains 16 sections, 2 equations, 4 figures, 1 table.

Figures (4)

  • Figure 1: Image Properties. Medium and high ambiguity here corresponds to an ambiguity measure of 0.4 to 0.49 and over 0.65 respectively. Examples from the ECP Dataset braun2019eurocity.
  • Figure 3: Results for two training sets and three test sets including different degrees of ambiguity. "Original" denotes the original ECP training and validation sets, "Amb 0.65" and "Amb 0.5" the same subsets pruned above an ambiguity threshold of 0.65 and 0.5.
  • Figure 4: Comparison of recall, precision and F1 score for two different training and test datasets.
  • Figure 5: Distribution of occlusion and truncation tags for different ambiguity thresholds.