Label Convergence: Defining an Upper Performance Bound in Object Recognition through Contradictory Annotations

David Tschirschwitz; Volker Rodehorst

Label Convergence: Defining an Upper Performance Bound in Object Recognition through Contradictory Annotations

David Tschirschwitz, Volker Rodehorst

TL;DR

Label convergence defines an intrinsic upper bound on object recognition performance under unavoidable annotation variation. The paper develops a model-independent, data-centric framework that combines a two-annotator modified mAP and a Krippendorff's Alpha-based regression to map inter-annotator agreement to $mAP$ across datasets, applying bootstrapping on LVIS, TexBiG, and VinDr-CXR. It reports LVIS convergence in the interval $[62.64, 67.52]$ mAP, with Co-DETR near the upper bound, indicating current model capacity is sufficient but label quality limits remain. The authors advocate updating evaluation practices, creating cleaner test data, and incorporating multi-annotated data to reveal annotation variation from the outset, shifting emphasis toward data-centric improvements and realistic evaluation under noise.

Abstract

Annotation errors are a challenge not only during training of machine learning models, but also during their evaluation. Label variations and inaccuracies in datasets often manifest as contradictory examples that deviate from established labeling conventions. Such inconsistencies, when significant, prevent models from achieving optimal performance on metrics such as mean Average Precision (mAP). We introduce the notion of "label convergence" to describe the highest achievable performance under the constraint of contradictory test annotations, essentially defining an upper bound on model accuracy. Recognizing that noise is an inherent characteristic of all data, our study analyzes five real-world datasets, including the LVIS dataset, to investigate the phenomenon of label convergence. We approximate that label convergence is between 62.63-67.52 mAP@[0.5:0.95:0.05] for LVIS with 95% confidence, attributing these bounds to the presence of real annotation errors. With current state-of-the-art (SOTA) models at the upper end of the label convergence interval for the well-studied LVIS dataset, we conclude that model capacity is sufficient to solve current object detection problems. Therefore, future efforts should focus on three key aspects: (1) updating the problem specification and adjusting evaluation practices to account for unavoidable label noise, (2) creating cleaner data, especially test data, and (3) including multi-annotated data to investigate annotation variation and make these issues visible from the outset.

Label Convergence: Defining an Upper Performance Bound in Object Recognition through Contradictory Annotations

TL;DR

across datasets, applying bootstrapping on LVIS, TexBiG, and VinDr-CXR. It reports LVIS convergence in the interval

mAP, with Co-DETR near the upper bound, indicating current model capacity is sufficient but label quality limits remain. The authors advocate updating evaluation practices, creating cleaner test data, and incorporating multi-annotated data to reveal annotation variation from the outset, shifting emphasis toward data-centric improvements and realistic evaluation under noise.

Abstract

Paper Structure (10 sections, 10 figures, 3 tables)

This paper contains 10 sections, 10 figures, 3 tables.

Introduction
Related Work
Analyzed Dataset
Determining Label Convergence
Two Annotators per Image
Considerations for Extrapolation:
Multiple Annotators per Image
Annotation Variation Type Analysis
Conclusion
Outlook

Figures (10)

Figure 1: Illustration of the convergence threshold intervals and the respective state-of-the-art (SOTA) results, highlighting how close the data points are to the upper performance bound -- at least for LVIS. For the LVIS dataset, the convergence threshold and SOTA results (using pre-trained Co-DETR co_detr) are evaluated on the consistency subset of LVIS. The convergence threshold interval for LVIS is created directly using modified mAP@[0.5,0.95,0.05] as described in Section \ref{['sec:two-annos']}. For the TexBiG and VinDr-CXR datasets, the convergence threshold interval is inferred using the formula from K-$\alpha$ to mAP introduced in Section \ref{['sec:mul-annos']}. The TexBiG and VinDr-CXR datasets utilize their respective leaderboard results, with TexBiG using modified mAP@[0.5,0.95,0.05] and VinDr-CXR using the mAP@0.4 metric as used in their respective leaderboards.
Figure 2: Human label variation causes and annotation errors causing five different types of variations. Our graphic is inspired by Plank plank2022. The fifth error type is illustrated between the green and red radiologist.
Figure 3: Best viewed digitally. Linear least-squares regression fit with K-$\alpha$ as the independent and mAP as the dependent variable. The scatter plots only show a fraction of the data points.
Figure 4: Quantitative results of the variation analysis on object detection, showing the ratio of correctly matched instances compared to unmatched instances. The most common variation types are 2 and 3 (Figure \ref{['fig:label_variation_types']}), which increase significantly as instances become harder to match due to higher IoU thresholds.
Figure 5: Example image from the LVIS Consistency Subset showing some of the variations. Only a selection of annotations from the cupboard class is visualized. The dotted line indicates coder A, while the dashed line indicates coder B. (a) At the top left and top right are two cases of a merging issue visible. This kind of inconsistencies should be prevented by a unambiguous guideline. While coder A separated the yellow and green areas of the two cupboards at the cupboard doors, coder B combined them as indicated by the magenta bounding boxes. (b) The two orange areas indicate an additional instance at the top, where a window is interpreted as a cupboard, while the bottom instance was found by coder A but missed by coder B. We attribute the first variation to image quality and the second one to an annotation error. (c) At the bottom left, another case of different labeling conventions is visible, where coder A covers the entire height of the cupboard, including the drawer, while coder B excludes the drawer from the cupboard. This could be due to either ambiguities in the guidelines or the skill of the annotator.
...and 5 more figures

Label Convergence: Defining an Upper Performance Bound in Object Recognition through Contradictory Annotations

TL;DR

Abstract

Label Convergence: Defining an Upper Performance Bound in Object Recognition through Contradictory Annotations

Authors

TL;DR

Abstract

Table of Contents

Figures (10)