Table of Contents
Fetching ...

CytoCrowd: A Multi-Annotator Benchmark Dataset for Cytology Image Analysis

Yonghao Si, Xingyuan Zeng, Zhao Chen, Libin Zheng, Caleb Chen Cao, Lei Chen, Jian Yin

TL;DR

CytoCrowd tackles the gap between single-ground-truth medical datasets and multi-annotator collections by offering $446$ cytology images with $14{,}579$ raw annotations from $4$ pathologists and a senior-expert gold standard comprising $6{,}402$ objects across $34$ classes. It enables simultaneous evaluation of standard computer vision tasks using the gold GT and annotation aggregation algorithms using the raw disagreements. Baseline experiments show that simple aggregation (Majority Voting) can outperform more complex models on expert data, while domain-focused segmentation models like DeepEdit and Anytime achieve strong localization when trained against the GT, underscoring the need for specialized approaches in cytology. Overall, CytoCrowd is poised to drive development of methods that robustly handle inter-observer variability and uncertainty in medical image analysis.

Abstract

High-quality annotated datasets are crucial for advancing machine learning in medical image analysis. However, a critical gap exists: most datasets either offer a single, clean ground truth, which hides real-world expert disagreement, or they provide multiple annotations without a separate gold standard for objective evaluation. To bridge this gap, we introduce CytoCrowd, a new public benchmark for cytology analysis. The dataset features 446 high-resolution images, each with two key components: (1) raw, conflicting annotations from four independent pathologists, and (2) a separate, high-quality gold-standard ground truth established by a senior expert. This dual structure makes CytoCrowd a versatile resource. It serves as a benchmark for standard computer vision tasks, such as object detection and classification, using the ground truth. Simultaneously, it provides a realistic testbed for evaluating annotation aggregation algorithms that must resolve expert disagreements. We provide comprehensive baseline results for both tasks. Our experiments demonstrate the challenges presented by CytoCrowd and establish its value as a resource for developing the next generation of models for medical image analysis.

CytoCrowd: A Multi-Annotator Benchmark Dataset for Cytology Image Analysis

TL;DR

CytoCrowd tackles the gap between single-ground-truth medical datasets and multi-annotator collections by offering cytology images with raw annotations from pathologists and a senior-expert gold standard comprising objects across classes. It enables simultaneous evaluation of standard computer vision tasks using the gold GT and annotation aggregation algorithms using the raw disagreements. Baseline experiments show that simple aggregation (Majority Voting) can outperform more complex models on expert data, while domain-focused segmentation models like DeepEdit and Anytime achieve strong localization when trained against the GT, underscoring the need for specialized approaches in cytology. Overall, CytoCrowd is poised to drive development of methods that robustly handle inter-observer variability and uncertainty in medical image analysis.

Abstract

High-quality annotated datasets are crucial for advancing machine learning in medical image analysis. However, a critical gap exists: most datasets either offer a single, clean ground truth, which hides real-world expert disagreement, or they provide multiple annotations without a separate gold standard for objective evaluation. To bridge this gap, we introduce CytoCrowd, a new public benchmark for cytology analysis. The dataset features 446 high-resolution images, each with two key components: (1) raw, conflicting annotations from four independent pathologists, and (2) a separate, high-quality gold-standard ground truth established by a senior expert. This dual structure makes CytoCrowd a versatile resource. It serves as a benchmark for standard computer vision tasks, such as object detection and classification, using the ground truth. Simultaneously, it provides a realistic testbed for evaluating annotation aggregation algorithms that must resolve expert disagreements. We provide comprehensive baseline results for both tasks. Our experiments demonstrate the challenges presented by CytoCrowd and establish its value as a resource for developing the next generation of models for medical image analysis.
Paper Structure (14 sections, 1 figure, 4 tables)