Expectation-Maximization as the Engine of Scalable Medical Intelligence

Wenxuan Li; Pedro R. A. S. Bassi; Tianyu Lin; Yu-Cheng Chou; Jakob Wasserthal; Xinze Zhou; Qi Chen; Fabian Isensee; Yannick Kirchhoff; Maximilian Rokuss; Saikat Roy; Constantin Ulrich; Klaus Maier-Hein; Szymon Płotka; Xiaoxi Chen; Kang Wang; Yang Yang; Daguang Xu; Kai Ding; Yucheng Tang; Alan L. Yuille; Zongwei Zhou

Expectation-Maximization as the Engine of Scalable Medical Intelligence

Wenxuan Li, Pedro R. A. S. Bassi, Tianyu Lin, Yu-Cheng Chou, Jakob Wasserthal, Xinze Zhou, Qi Chen, Fabian Isensee, Yannick Kirchhoff, Maximilian Rokuss, Saikat Roy, Constantin Ulrich, Klaus Maier-Hein, Szymon Płotka, Xiaoxi Chen, Kang Wang, Yang Yang, Daguang Xu, Kai Ding, Yucheng Tang, Alan L. Yuille, Zongwei Zhou

TL;DR

ScaleMAI reframes the traditional EM algorithm for medical imaging by treating annotations as missing data and introducing an iterative loop where a model critiques its own training data. The Expectation step uses automated tools (Label Verifier and Label Expert) to refine noisy labels, while the Maximization step retrains with a mix of unlabeled, synthetic, and selectively reviewed data, guided by ROC analysis to maximize sensitivity. The approach yields PanTS-XL, a 47,315-scan, 88-structure dataset with per-voxel annotations, and Flagship Model that matches or surpasses human expert performance in tumor diagnosis and improves tumor detection and segmentation on multiple benchmarks. This framework significantly reduces expert workload and demonstrates scalable, data-driven improvement of medical AI across large-scale CT datasets and diverse imaging conditions.

Abstract

Large, high-quality, annotated datasets are the foundation of medical AI research, but constructing even a small, moderate-quality, annotated dataset can take years of effort from multidisciplinary teams. Although active learning can prioritize what to annotate, scaling up still requires extensive manual efforts to revise the noisy annotations. We formulate this as a missing-data problem and develop ScaleMAI, a framework that unifies data annotation and model development co-evolution through an Expectation-Maximization (EM) process. In this iterative process, the AI model automatically identifies and corrects the mistakes in annotations (Expectation), while the refined annotated data retrain the model to improve accuracy (Maximization). In addition to the classical EM algorithm, ScaleMAI brings human experts into the loop to review annotations that cannot be adequately addressed by either Expectation or Maximization step (<5%). As a result, ScaleMAI progressively creates an annotated dataset of 47,315 CT scans (4.8x larger than the largest public dataset, PanTS) including 4,163,720 per-voxel annotations for benign/malignant tumors and 88 anatomical structures. ScaleMAI iteratively trains a model that exceeds human expert performance in tumor diagnosis (+7%), and outperforms models developed from smaller, moderate-quality datasets, with statistically significant gains in tumor detection (+10%) and segmentation (+14%) on two prestigious benchmarks.

Expectation-Maximization as the Engine of Scalable Medical Intelligence

TL;DR

Abstract

Paper Structure (38 sections, 5 equations, 13 figures, 7 tables, 2 algorithms)

This paper contains 38 sections, 5 equations, 13 figures, 7 tables, 2 algorithms.

Introduction
ScaleMAI
The Expectation Step
Label Verifier
Label Expert
The Maximization Step
ROC Analysis
Continual Tuning
An Executable Summary
Contribution #1: PanTS-XL Dataset
Dataset Overview
Gold Standard vs. Silver Standard Annotation
Reader Study: Tumor Detection & Diagnosis
High Quality Anatomical Structure Annotation
Contribution #2: Flagship Model
...and 23 more sections

Figures (13)

Figure 1: ScaleMAI reimagines the classic Expectation–Maximization (EM) algorithm dempster1977maximum for the problem of building large, high-quality medical datasets when expert annotations are scarce and noisy. Instead of training a model on a fixed dataset and stopping there, we let the model and the dataset improve each other in a loop. At a high level, the model first "overfits" to the current dataset and then acts as a critic of that same dataset: wherever its predictions and the existing annotations disagree strongly, we treat this as missing or unreliable information. In the Expectation step, automatic tools (Label Verifier and Label Expert) use this disagreement to correct easy annotation errors and highlight only the most doubtful regions for human review. In the Maximization step, human experts focus on those few flagged cases, refine the annotations with the help of ROC-guided prioritization, and the model is retrained on this improved dataset using a mixture of unlabeled, synthetic, and selectively sampled scans. Repeating this cycle gradually turns a small, imperfect dataset into a large, expert-level resource, while keeping human effort concentrated on the $<$5% of annotations where the AI remains uncertain.
Figure 2: Label Expert selects higher-quality annotations across diverse anatomical structures. Evaluated on a 3,000 CT scan validation set, Label Expert achieves 96.5% accuracy across all 88 classes. We report results on organ-at-risk for pancreatic tumors. Label Expert consistently chooses the better annotation, including challenging cases such as the pancreas, where it correctly selected 116 of 124 comparisons (93.5% accuracy). This indicates its effectiveness in identifying higher-quality labels.
Figure 3: ROC analysis for pancreatic tumor annotation. Annotating per-voxel tumors is time-consuming. Our ROC analysis strategy biases AI predictions toward high sensitivity. Inevitably, this generates more false positives, but removing them is much faster and easier ($<$5 sec/tumor) than creating annotations from scratch (4--5 min/tumor). False positives in non-tumor CT scans can be automatically removed using radiology reports, and false positives in tumor CT scans can be erased with a few clicks. We achieved 99% sensitivity for pancreatic tumor detection with only 0.6 false positives per scan---reducing annotation time by up to 92% comparing to traditional methods.
Figure 4: Comparison of pancreatic and abdominal CT datasets. We compare PanTS-XL with public datasets along eight axes: number of CT scans, first-time public scans, annotated structures, contributing centers, contributing countries, availability of structured and narrative reports, and total per-voxel annotations. Earlier pancreatic and abdominal datasets were already benchmarked in the PanTS study li2025pants; therefore, our comparison focuses on PanTS and PanTS-XL. A detailed comparison with public available datasets is provided in \ref{['sec:supp_related_datasets']}.
Figure 5: Flagship Model matches human readers in tumor detection and surpasses them in tumor diagnosis. We compare Flagship Model with 13 human readers (6 junior, 5 senior, 2 expert) on pancreatic tumor detection and diagnosis. Each reader independently evaluated 50 patients (100 contrast-enhanced CT scans); Flagship Model was evaluated on a larger cohort of 982 patients (1,964 scans). Tumor detection. ROC curves (top left) show that Flagship Model achieves an AUC of 0.961, surpassing MedNeXt roy2023mednext (0.846) by 13.5% and ResEncL isensee2024nnu (0.810) by 15.1%, while matching the sensitivity–specificity performance of human readers. Tumor diagnosis. Confusion matrices (bottom right) show that Flagship Model attains 72% accuracy, outperforming junior (61%; +11%), senior (66%; +6%), and expert readers (69%; +3%) across PDAC, cyst, and PNET classification. Additional reader-study analyses are provided in §\ref{['sec:supp_reader_study']}.
...and 8 more figures

Expectation-Maximization as the Engine of Scalable Medical Intelligence

TL;DR

Abstract

Expectation-Maximization as the Engine of Scalable Medical Intelligence

Authors

TL;DR

Abstract

Table of Contents

Figures (13)