CytoCrowd: A Multi-Annotator Benchmark Dataset for Cytology Image Analysis

Yonghao Si; Xingyuan Zeng; Zhao Chen; Libin Zheng; Caleb Chen Cao; Lei Chen; Jian Yin

CytoCrowd: A Multi-Annotator Benchmark Dataset for Cytology Image Analysis

Yonghao Si, Xingyuan Zeng, Zhao Chen, Libin Zheng, Caleb Chen Cao, Lei Chen, Jian Yin

TL;DR

CytoCrowd tackles the gap between single-ground-truth medical datasets and multi-annotator collections by offering $446$ cytology images with $14{,}579$ raw annotations from $4$ pathologists and a senior-expert gold standard comprising $6{,}402$ objects across $34$ classes. It enables simultaneous evaluation of standard computer vision tasks using the gold GT and annotation aggregation algorithms using the raw disagreements. Baseline experiments show that simple aggregation (Majority Voting) can outperform more complex models on expert data, while domain-focused segmentation models like DeepEdit and Anytime achieve strong localization when trained against the GT, underscoring the need for specialized approaches in cytology. Overall, CytoCrowd is poised to drive development of methods that robustly handle inter-observer variability and uncertainty in medical image analysis.

Abstract

High-quality annotated datasets are crucial for advancing machine learning in medical image analysis. However, a critical gap exists: most datasets either offer a single, clean ground truth, which hides real-world expert disagreement, or they provide multiple annotations without a separate gold standard for objective evaluation. To bridge this gap, we introduce CytoCrowd, a new public benchmark for cytology analysis. The dataset features 446 high-resolution images, each with two key components: (1) raw, conflicting annotations from four independent pathologists, and (2) a separate, high-quality gold-standard ground truth established by a senior expert. This dual structure makes CytoCrowd a versatile resource. It serves as a benchmark for standard computer vision tasks, such as object detection and classification, using the ground truth. Simultaneously, it provides a realistic testbed for evaluating annotation aggregation algorithms that must resolve expert disagreements. We provide comprehensive baseline results for both tasks. Our experiments demonstrate the challenges presented by CytoCrowd and establish its value as a resource for developing the next generation of models for medical image analysis.

CytoCrowd: A Multi-Annotator Benchmark Dataset for Cytology Image Analysis

TL;DR

CytoCrowd tackles the gap between single-ground-truth medical datasets and multi-annotator collections by offering

cytology images with

raw annotations from

pathologists and a senior-expert gold standard comprising

objects across

classes. It enables simultaneous evaluation of standard computer vision tasks using the gold GT and annotation aggregation algorithms using the raw disagreements. Baseline experiments show that simple aggregation (Majority Voting) can outperform more complex models on expert data, while domain-focused segmentation models like DeepEdit and Anytime achieve strong localization when trained against the GT, underscoring the need for specialized approaches in cytology. Overall, CytoCrowd is poised to drive development of methods that robustly handle inter-observer variability and uncertainty in medical image analysis.

Abstract

Paper Structure (14 sections, 1 figure, 4 tables)

This paper contains 14 sections, 1 figure, 4 tables.

Introduction
Related Work
The CytoCrowd Dataset
Expert Disagreement Analysis.
Experiments and Baselines
Task Definition
Evaluation Metrics
Baseline Methods
Annotation Aggregation Baselines
Learning-based Baselines
Performance Analysis
Performance of Annotation Aggregation Methods
Performance of Learning-based Methods
Conclusion

Figures (1)

Figure 1: Raw expert annotations (left) vs. the final gold-standard ground truth (right) on a sample image from CytoCrowd.

CytoCrowd: A Multi-Annotator Benchmark Dataset for Cytology Image Analysis

TL;DR

Abstract

CytoCrowd: A Multi-Annotator Benchmark Dataset for Cytology Image Analysis

Authors

TL;DR

Abstract

Table of Contents

Figures (1)