Expectation-Maximization as the Engine of Scalable Medical Intelligence
Wenxuan Li, Pedro R. A. S. Bassi, Tianyu Lin, Yu-Cheng Chou, Jakob Wasserthal, Xinze Zhou, Qi Chen, Fabian Isensee, Yannick Kirchhoff, Maximilian Rokuss, Saikat Roy, Constantin Ulrich, Klaus Maier-Hein, Szymon Płotka, Xiaoxi Chen, Kang Wang, Yang Yang, Daguang Xu, Kai Ding, Yucheng Tang, Alan L. Yuille, Zongwei Zhou
TL;DR
ScaleMAI reframes the traditional EM algorithm for medical imaging by treating annotations as missing data and introducing an iterative loop where a model critiques its own training data. The Expectation step uses automated tools (Label Verifier and Label Expert) to refine noisy labels, while the Maximization step retrains with a mix of unlabeled, synthetic, and selectively reviewed data, guided by ROC analysis to maximize sensitivity. The approach yields PanTS-XL, a 47,315-scan, 88-structure dataset with per-voxel annotations, and Flagship Model that matches or surpasses human expert performance in tumor diagnosis and improves tumor detection and segmentation on multiple benchmarks. This framework significantly reduces expert workload and demonstrates scalable, data-driven improvement of medical AI across large-scale CT datasets and diverse imaging conditions.
Abstract
Large, high-quality, annotated datasets are the foundation of medical AI research, but constructing even a small, moderate-quality, annotated dataset can take years of effort from multidisciplinary teams. Although active learning can prioritize what to annotate, scaling up still requires extensive manual efforts to revise the noisy annotations. We formulate this as a missing-data problem and develop ScaleMAI, a framework that unifies data annotation and model development co-evolution through an Expectation-Maximization (EM) process. In this iterative process, the AI model automatically identifies and corrects the mistakes in annotations (Expectation), while the refined annotated data retrain the model to improve accuracy (Maximization). In addition to the classical EM algorithm, ScaleMAI brings human experts into the loop to review annotations that cannot be adequately addressed by either Expectation or Maximization step (<5%). As a result, ScaleMAI progressively creates an annotated dataset of 47,315 CT scans (4.8x larger than the largest public dataset, PanTS) including 4,163,720 per-voxel annotations for benign/malignant tumors and 88 anatomical structures. ScaleMAI iteratively trains a model that exceeds human expert performance in tumor diagnosis (+7%), and outperforms models developed from smaller, moderate-quality datasets, with statistically significant gains in tumor detection (+10%) and segmentation (+14%) on two prestigious benchmarks.
