Noisy Annotations in Semantic Segmentation
Moshe Kimhi, Omer Kerem, Eden Grad, Ehud Rivlin, Chaim Baskin
TL;DR
This work systematically investigates noisy annotations in semantic/instance segmentation by introducing synthetic (VIPER-N) and real-world (COCO-N, CityScapes-N) benchmarks, plus a weakly-annotated tool (COCO-WAN) to simulate noisy labels from foundation-model prompts. It defines five noise types, analyzes model robustness across architectures (including Transformer-based backbones), and shows substantial degradation in mask quality, boundary accuracy, and confidence under noise. The study also links segmentation noise to clinical risks via the CAMUS EF metric, and provides qualitative analyses, ablations, and learning-with-noisy-label explorations, highlighting the need for noise-aware training, improved annotation pipelines, and robust architectures. Overall, the results underscore the gap between current LNL methods (primarily for classification) and the demands of spatially precise segmentation, motivating a toolkit (Benchmark-N) and future directions for resilient semantic segmentation. The work concludes with practical recommendations and releases to enable reproducible evaluation of noisy-label robustness in real-world segmentation tasks.
Abstract
Obtaining accurate labels for instance segmentation is particularly challenging due to the complex nature of the task. Each image necessitates multiple annotations, encompassing not only the object class but also its precise spatial boundaries. These requirements elevate the likelihood of errors and inconsistencies in both manual and automated annotation processes. By simulating different noise conditions, we provide a realistic scenario for assessing the robustness and generalization capabilities of instance segmentation models in different segmentation tasks, introducing COCO-N and Cityscapes-N. We also propose a benchmark for weakly annotation noise, dubbed COCO-WAN, which utilizes foundation models and weak annotations to simulate semi-automated annotation tools and their noisy labels. This study sheds light on the quality of segmentation masks produced by various models and challenges the efficacy of popular methods designed to address learning with label noise.
