Table of Contents
Fetching ...

MultiOrg: A Multi-rater Organoid-detection Dataset

Christina Bukas, Harshavardhan Subramanian, Fenja See, Carina Steinchen, Ivan Ezhov, Gowtham Boosarpu, Sara Asgharpour, Gerald Burgstaller, Mareike Lehmann, Florian Kofler, Marie Piraud

TL;DR

MultiOrg introduces a large, openly available 2D organoid-detection dataset with multi-rater annotations to quantify label uncertainty in biomedical imaging. It provides over 400 high-resolution images and more than 60,000 bounding boxes, annotated by two experts at two time points, plus three test-label sets to study annotation noise. A COCO-formatted benchmark of four detectors (Faster R-CNN, SSD, YOLOv3, RTMDet) demonstrates task difficulty and model trade-offs, with SSD delivering strong mAP@0.5 while revealing label-noise resilience. The authors also release a Napari plugin for interactive quantification and curation, along with Kaggle/Zenodo resources to support reproducibility and uncertainty research. Overall, MultiOrg advances open datasets at the intersection of microscopy and uncertainty quantification, enabling robust, high-throughput organoid quantification and benchmarking across label variability.

Abstract

High-throughput image analysis in the biomedical domain has gained significant attention in recent years, driving advancements in drug discovery, disease prediction, and personalized medicine. Organoids, specifically, are an active area of research, providing excellent models for human organs and their functions. Automating the quantification of organoids in microscopy images would provide an effective solution to overcome substantial manual quantification bottlenecks, particularly in high-throughput image analysis. However, there is a notable lack of open biomedical datasets, in contrast to other domains, such as autonomous driving, and, notably, only few of them have attempted to quantify annotation uncertainty. In this work, we present MultiOrg a comprehensive organoid dataset tailored for object detection tasks with uncertainty quantification. This dataset comprises over 400 high-resolution 2d microscopy images and curated annotations of more than 60,000 organoids. Most importantly, it includes three label sets for the test data, independently annotated by two experts at distinct time points. We additionally provide a benchmark for organoid detection, and make the best model available through an easily installable, interactive plugin for the popular image visualization tool Napari, to perform organoid quantification.

MultiOrg: A Multi-rater Organoid-detection Dataset

TL;DR

MultiOrg introduces a large, openly available 2D organoid-detection dataset with multi-rater annotations to quantify label uncertainty in biomedical imaging. It provides over 400 high-resolution images and more than 60,000 bounding boxes, annotated by two experts at two time points, plus three test-label sets to study annotation noise. A COCO-formatted benchmark of four detectors (Faster R-CNN, SSD, YOLOv3, RTMDet) demonstrates task difficulty and model trade-offs, with SSD delivering strong mAP@0.5 while revealing label-noise resilience. The authors also release a Napari plugin for interactive quantification and curation, along with Kaggle/Zenodo resources to support reproducibility and uncertainty research. Overall, MultiOrg advances open datasets at the intersection of microscopy and uncertainty quantification, enabling robust, high-throughput organoid quantification and benchmarking across label variability.

Abstract

High-throughput image analysis in the biomedical domain has gained significant attention in recent years, driving advancements in drug discovery, disease prediction, and personalized medicine. Organoids, specifically, are an active area of research, providing excellent models for human organs and their functions. Automating the quantification of organoids in microscopy images would provide an effective solution to overcome substantial manual quantification bottlenecks, particularly in high-throughput image analysis. However, there is a notable lack of open biomedical datasets, in contrast to other domains, such as autonomous driving, and, notably, only few of them have attempted to quantify annotation uncertainty. In this work, we present MultiOrg a comprehensive organoid dataset tailored for object detection tasks with uncertainty quantification. This dataset comprises over 400 high-resolution 2d microscopy images and curated annotations of more than 60,000 organoids. Most importantly, it includes three label sets for the test data, independently annotated by two experts at distinct time points. We additionally provide a benchmark for organoid detection, and make the best model available through an easily installable, interactive plugin for the popular image visualization tool Napari, to perform organoid quantification.

Paper Structure

This paper contains 34 sections, 9 figures, 10 tables.

Figures (9)

  • Figure 1: MultiOrg workflow. a) Dataset creation, b) Multi-rater annotation at time points $t^0$ and $t^1$, c) Model benchmark, and d) Release on Kaggle and napari plugin
  • Figure 2: Multiple label sets in MultiOrg. Full test image (left) and crops of areas A, B, and C overlaid with $test^0$, $test^1_A$ and $test^1_B$ (right). The square crops are of sizes 1800, 1200, and 500 px. $test^0$ in images 4 and 16 (respectively 24 and 43) originates from Annotator A (resp. B). 'Macros' are typically noisier, as the cultures initially contain more cells (\ref{['sec:biological_experimental_setup']}).We observe a reduction in the number of annotations at time $t^1$, as the annotators do not consider some small organoids that were annotated at $t^0$. In image 24, Annotator B annotates clumps of organoids as one large object at $t^1$. The large structure in image 43 is an experimental matrigel artifact. The image-wise intra-rater Recall scores are 0.776, 0.532, 0.667 and 0.503 for images 4, 16, 24, and 43, respectively (with $test^0$ as GT).
  • Figure 3: Multi-rater scores. Top: Intra-rater F1-score (left), Precision (middle), and Recall (right), where $test^0$ is considered the GT, for both annotators and according to study type. Annotator A appears more consistent on 'Normal' images (higher scores), and annotation of 'Macros' seems more challenging (with lower scores). Both annotators show an overall higher Precision and lower Recall, indicating that $test^0$ has many more annotations which are treated here as FNs. Bottom: Inter-rater F1-score (left), Precision (middle), and Recall (right) on the test set between $test^1_A$ and $test^1_B$, where $test^1_A$ is considered the GT, split according to study type. Raters agree more on 'Normal' images, indicating that the annotation of 'Macros' images is more challenging. Individual differences are generally lower than in-between raters (lower inter-rater than intra-rater scores).
  • Figure 4: Model Benchmark. P-R curves using $test^0$ as the GT for all models (left) and using all three label sets for SSD (right). We observe that overall the SSD model predictions are more in agreement with the annotations and have a better trade-off between precision and recall. Although the model was trained and validated with labels from $t^0$ it is more in agreement with annotations from timepoint $t^1$.
  • Figure A.5: Bounding box sizes. Box plots of the bounding box areas in $test^0$, $test^1_A$ and $test^1_B$, stratified by study type, on a logarithmic scale.
  • ...and 4 more figures