Quality Assured: Rethinking Annotation Strategies in Imaging AI

Tim Rädsch; Annika Reinke; Vivienn Weru; Minu D. Tizabi; Nicholas Heller; Fabian Isensee; Annette Kopp-Schneider; Lena Maier-Hein

Quality Assured: Rethinking Annotation Strategies in Imaging AI

Tim Rädsch, Annika Reinke, Vivienn Weru, Minu D. Tizabi, Nicholas Heller, Fabian Isensee, Annette Kopp-Schneider, Lena Maier-Hein

TL;DR

This work evaluates high-quality reference annotation generation for imaging AI by comparing annotation providers (four companies vs MTurk) and dissecting internal QA versus improved labeling instructions. Using the HeiCo dataset with $57{,}648$ instance masks and $57{,}636$ metadata annotations, the study finds annotation companies outperform MTurk in quality and cost, while internal QA provides only marginal gains; stronger improvements come from richer labeling instructions. Moreover, the benefit of QA depends on image characteristics, with QA helping for challenging generalizable features but not for many domain-specific traits. Practically, the results suggest researchers should prioritize detailed labeling instructions and consider annotation companies for reliable, cost-effective reference data, particularly in safety-critical imaging contexts, while future work should explore annotator training and integration with foundation-model workflows.

Abstract

This paper does not describe a novel method. Instead, it studies an essential foundation for reliable benchmarking and ultimately real-world application of AI-based image analysis: generating high-quality reference annotations. Previous research has focused on crowdsourcing as a means of outsourcing annotations. However, little attention has so far been given to annotation companies, specifically regarding their internal quality assurance (QA) processes. Therefore, our aim is to evaluate the influence of QA employed by annotation companies on annotation quality and devise methodologies for maximizing data annotation efficacy. Based on a total of 57,648 instance segmented images obtained from a total of 924 annotators and 34 QA workers from four annotation companies and Amazon Mechanical Turk (MTurk), we derived the following insights: (1) Annotation companies perform better both in terms of quantity and quality compared to the widely used platform MTurk. (2) Annotation companies' internal QA only provides marginal improvements, if any. However, improving labeling instructions instead of investing in QA can substantially boost annotation performance. (3) The benefit of internal QA depends on specific image characteristics. Our work could enable researchers to derive substantially more value from a fixed annotation budget and change the way annotation companies conduct internal QA.

Quality Assured: Rethinking Annotation Strategies in Imaging AI

TL;DR

instance masks and

metadata annotations, the study finds annotation companies outperform MTurk in quality and cost, while internal QA provides only marginal gains; stronger improvements come from richer labeling instructions. Moreover, the benefit of QA depends on image characteristics, with QA helping for challenging generalizable features but not for many domain-specific traits. Practically, the results suggest researchers should prioritize detailed labeling instructions and consider annotation companies for reliable, cost-effective reference data, particularly in safety-critical imaging contexts, while future work should explore annotator training and integration with foundation-model workflows.

Abstract

Paper Structure (24 sections, 1 equation, 9 figures, 2 tables)

This paper contains 24 sections, 1 equation, 9 figures, 2 tables.

Introduction
Related Work
Comparison of annotation providers
Quality assurance
Impact of real-world image characteristics on annotation quality
Methods
Data
Labeling instructions
Annotation providers
Quality assurance process
Experimental design
Results
How does the choice of annotation provider influence the quality and cost of annotations?
Does annotation companies' quality assurance improve annotation quality?
Should quality assurance focus on specific subsets of images?
...and 9 more sections

Figures (9)

Figure 1: Real-world image characteristics examples. During recording, a broad range of (a) generalizable and (b) domain-specific image characteristics occur and can influence humans and models alike.
Figure 2: a) In contrast to Amazon Mechanical Turk (MTurk), annotation companies commonly perform quality assurance (QA) before delivering the result to the annotation requester. b) Annotation providers are an integral part of the computer vision community. They account for around 20% of exhibitors present at major computer vision conferences (ECCV 2022: n = 40, CVPR 2022: n = 97, ICCV 2023: n = 43, CVPR 2023: n = 104) in the last two years.
Figure 2: Only MTurk crowdworkers generated spam annotations. The annotation companies generated no spam annotations, whereas around 20% of the collected MTurk annotations were spam. The spam annotations vary greatly in their style and seem to display different strategies to game the MTurk annotation tasks.
Figure 3: Annotation companies are more efficient at generating high-quality images than Amazon Mechanical Turk (MTurk). Percentage of images without severe errors (green), images with severe errors (orange), and spam (red) obtained for the five different annotation providers investigated. Annotation costs per image were lower for all companies compared to MTurk (median: 61.1%).
Figure 3: The percentage of modifications by QA workers differed greatly between the companies. However, this did not necessarily result in higher annotation quality, when compared to other annotation companies.a, The percentages of modified and unadapted images by QA workers. An image counts as modified if the responsible QA worker adapted the received instance segmentation mask from the annotator. b, The DSC was aggregated for each resulting annotated image and is displayed aggregated for each pair of annotation company and labeling instruction as a dots- and boxplot (the band indicates the median, the box indicates the first and third quartiles and the whiskers indicate $\pm 1.5 \times$ interquartile range), the DSC maximum is 1 and the minimum is 0 for each image.
...and 4 more figures

Quality Assured: Rethinking Annotation Strategies in Imaging AI

TL;DR

Abstract

Quality Assured: Rethinking Annotation Strategies in Imaging AI

Authors

TL;DR

Abstract

Table of Contents

Figures (9)