Domain generalization across tumor types, laboratories, and species -- insights from the 2022 edition of the Mitosis Domain Generalization Challenge

Marc Aubreville; Nikolas Stathonikos; Taryn A. Donovan; Robert Klopfleisch; Jonathan Ganz; Jonas Ammeling; Frauke Wilm; Mitko Veta; Samir Jabari; Markus Eckstein; Jonas Annuscheit; Christian Krumnow; Engin Bozaba; Sercan Cayir; Hongyan Gu; Xiang 'Anthony' Chen; Mostafa Jahanifar; Adam Shephard; Satoshi Kondo; Satoshi Kasai; Sujatha Kotte; VG Saipradeep; Maxime W. Lafarge; Viktor H. Koelzer; Ziyue Wang; Yongbing Zhang; Sen Yang; Xiyue Wang; Katharina Breininger; Christof A. Bertram

Domain generalization across tumor types, laboratories, and species -- insights from the 2022 edition of the Mitosis Domain Generalization Challenge

Marc Aubreville, Nikolas Stathonikos, Taryn A. Donovan, Robert Klopfleisch, Jonathan Ganz, Jonas Ammeling, Frauke Wilm, Mitko Veta, Samir Jabari, Markus Eckstein, Jonas Annuscheit, Christian Krumnow, Engin Bozaba, Sercan Cayir, Hongyan Gu, Xiang 'Anthony' Chen, Mostafa Jahanifar, Adam Shephard, Satoshi Kondo, Satoshi Kasai, Sujatha Kotte, VG Saipradeep, Maxime W. Lafarge, Viktor H. Koelzer, Ziyue Wang, Yongbing Zhang, Sen Yang, Xiyue Wang, Katharina Breininger, Christof A. Bertram

TL;DR

The paper evaluates domain generalization for mitotic figure detection across tumor types, species, labs, and scanners via the MIDOG 2022 challenge. It demonstrates that modern deep learning detectors can achieve strong performance (top $F_1$ around $0.764$) across diverse domains, though unseen feline species, spindle-cell morphology, and a new scanner slightly degrade recall. Ground-truth strategies include a robust three-expert HE consensus and a PHH3-assisted reference, revealing that PHH3 can raise mitotic counts and improve label consistency while potentially conflicting with current grading schemes. Overall, the study shows promising cross-domain generalization for histopathology ML, highlights limitations of AP as a ranking metric, and suggests directions for richer datasets, full-slide annotations, and domain-generalization methods to better translate models to real-world pathology practice.

Abstract

Recognition of mitotic figures in histologic tumor specimens is highly relevant to patient outcome assessment. This task is challenging for algorithms and human experts alike, with deterioration of algorithmic performance under shifts in image representations. Considerable covariate shifts occur when assessment is performed on different tumor types, images are acquired using different digitization devices, or specimens are produced in different laboratories. This observation motivated the inception of the 2022 challenge on MItosis Domain Generalization (MIDOG 2022). The challenge provided annotated histologic tumor images from six different domains and evaluated the algorithmic approaches for mitotic figure detection provided by nine challenge participants on ten independent domains. Ground truth for mitotic figure detection was established in two ways: a three-expert consensus and an independent, immunohistochemistry-assisted set of labels. This work represents an overview of the challenge tasks, the algorithmic strategies employed by the participants, and potential factors contributing to their success. With an $F_1$ score of 0.764 for the top-performing team, we summarize that domain generalization across various tumor domains is possible with today's deep learning-based recognition pipelines. However, we also found that domain characteristics not present in the training set (feline as new species, spindle cell shape as new morphology and a new scanner) led to small but significant decreases in performance. When assessed against the immunohistochemistry-assisted reference standard, all methods resulted in reduced recall scores, but with only minor changes in the order of participants in the ranking.

Domain generalization across tumor types, laboratories, and species -- insights from the 2022 edition of the Mitosis Domain Generalization Challenge

TL;DR

around

) across diverse domains, though unseen feline species, spindle-cell morphology, and a new scanner slightly degrade recall. Ground-truth strategies include a robust three-expert HE consensus and a PHH3-assisted reference, revealing that PHH3 can raise mitotic counts and improve label consistency while potentially conflicting with current grading schemes. Overall, the study shows promising cross-domain generalization for histopathology ML, highlights limitations of AP as a ranking metric, and suggests directions for richer datasets, full-slide annotations, and domain-generalization methods to better translate models to real-world pathology practice.

Abstract

score of 0.764 for the top-performing team, we summarize that domain generalization across various tumor domains is possible with today's deep learning-based recognition pipelines. However, we also found that domain characteristics not present in the training set (feline as new species, spindle cell shape as new morphology and a new scanner) led to small but significant decreases in performance. When assessed against the immunohistochemistry-assisted reference standard, all methods resulted in reduced recall scores, but with only minor changes in the order of participants in the ranking.

Paper Structure (29 sections, 12 figures, 7 tables)

This paper contains 29 sections, 12 figures, 7 tables.

Introduction
Material and evaluation methods
Challenge cohort and tumor domains
Establishment of ground truth
Dataset statistics
Reference approaches
Evaluation methods and metrics
Statistical analysis of the results
Overview of submitted methods
Pattern recognition tasks
Architectures
Ensembling and Test-Time Augmentation
Augmentation
Use of the unlabeled domain
Domain generalization methodologies
...and 14 more sections

Figures (12)

Figure 1: Random selection of crops of size $128\times 128$ px, centered around annotated MF from the six domains of the training set. Caption indicates the originating lab (UMCU = UMC Utrecht, VMU = University of Veterinary Medicine Vienna, FUB = FU of Berlin) and the scanners (S360 = Hamamatsu S360, XR = Hamamatsu XR, CS2 = Aperio ScanScope CS2, 3DH = 3DHIstech Pannoramic Scan II). Domain F was not labeled, hence the crops were selected at random.
Figure 2: Overview of the domains of the test set. Random cropouts sized $256\times 256$ px from four randomly selected images of each domain are shown. Caption indicates origin of tissue (UMCU = UMC Utrecht, UKER = University Hospital Erlangen, UKER NP = Institute of Neuropathology at University Hospital Erlangen, FUB = FU Berlin, VMU = University of Veterinary Medicine Vienna) and scanner (S360 = Hamamatsu S360, S60 = Hamamatsu S60, 3DH = 3DHistech Pannoramic Scan II). The tumor types are categorized by the tissue morphology into aggregated cell patterns, round cell morphology and spindle cell morphology.
Figure 3: Correspondence between hematoxylin and eosin (H&E)-stained tissue (top) and immunohistochemistry stain against phospho-histone H3 (PHH3, bottom). The left panel shows two tumor cells (green circles) with clear immunopositivity against PHH3 conclusive for MF, supporting HE morphology. The right panel shows a mitotic figure in telophase where the PHH3-stain is less conclusive, but the morphology in the HE is characteristic.
Figure 4: Histogram of MF and NMF in the training set of MIDOG 2022.
Figure 5: Box-whisker plot of the distribution of MC across the domains of the preliminary test set and the final challenge test set. Boxes indicate lower and upper quartile values, colored lines indicate median values.
...and 7 more figures

Domain generalization across tumor types, laboratories, and species -- insights from the 2022 edition of the Mitosis Domain Generalization Challenge

TL;DR

Abstract

Domain generalization across tumor types, laboratories, and species -- insights from the 2022 edition of the Mitosis Domain Generalization Challenge

Authors

TL;DR

Abstract

Table of Contents

Figures (12)