Table of Contents
Fetching ...

nnMIL: A generalizable multiple instance learning framework for computational pathology

Xiangde Luo, Jinxi Xiang, Yuanfeng Ji, Ruijiang Li

Abstract

Computational pathology holds substantial promise for improving diagnosis and guiding treatment decisions. Recent pathology foundation models enable the extraction of rich patch-level representations from large-scale whole-slide images (WSIs), but current approaches for aggregating these features into slide-level predictions remain constrained by design limitations that hinder generalizability and reliability. Here, we developed nnMIL, a simple yet broadly applicable multiple-instance learning framework that connects patch-level foundation models to robust slide-level clinical inference. nnMIL introduces random sampling at both the patch and feature levels, enabling large-batch optimization, task-aware sampling strategies, and efficient and scalable training across datasets and model architectures. A lightweight aggregator performs sliding-window inference to generate ensemble slide-level predictions and supports principled uncertainty estimation. Across 40,000 WSIs encompassing 35 clinical tasks and four pathology foundation models, nnMIL consistently outperformed existing MIL methods for disease diagnosis, histologic subtyping, molecular biomarker detection, and pan- cancer prognosis prediction. It further demonstrated strong cross-model generalization, reliable uncertainty quantification, and robust survival stratification in multiple external cohorts. In conclusion, nnMIL offers a practical and generalizable solution for translating pathology foundation models into clinically meaningful predictions, advancing the development and deployment of reliable AI systems in real-world settings.

nnMIL: A generalizable multiple instance learning framework for computational pathology

Abstract

Computational pathology holds substantial promise for improving diagnosis and guiding treatment decisions. Recent pathology foundation models enable the extraction of rich patch-level representations from large-scale whole-slide images (WSIs), but current approaches for aggregating these features into slide-level predictions remain constrained by design limitations that hinder generalizability and reliability. Here, we developed nnMIL, a simple yet broadly applicable multiple-instance learning framework that connects patch-level foundation models to robust slide-level clinical inference. nnMIL introduces random sampling at both the patch and feature levels, enabling large-batch optimization, task-aware sampling strategies, and efficient and scalable training across datasets and model architectures. A lightweight aggregator performs sliding-window inference to generate ensemble slide-level predictions and supports principled uncertainty estimation. Across 40,000 WSIs encompassing 35 clinical tasks and four pathology foundation models, nnMIL consistently outperformed existing MIL methods for disease diagnosis, histologic subtyping, molecular biomarker detection, and pan- cancer prognosis prediction. It further demonstrated strong cross-model generalization, reliable uncertainty quantification, and robust survival stratification in multiple external cohorts. In conclusion, nnMIL offers a practical and generalizable solution for translating pathology foundation models into clinically meaningful predictions, advancing the development and deployment of reliable AI systems in real-world settings.

Paper Structure

This paper contains 12 equations, 13 figures, 17 tables.

Figures (13)

  • Figure 1: Overview of nnMIL framework and evaluation. nnMIL is a generalizable multiple instance learning framework that shows superior performance across 35 slide-level computational pathology tasks. (a), The motivation of nnMIL is to enable MIL training in a simple yet generalizable manner. We realize this idea by leveraging large-batch optimization to enhance both robustness and effectiveness of model training goyal2017accuratemandt2017stochastic. Specifically, we introduce stochastic sampling at both the patch level (randomly sampling M out of N patches) and the feature level (randomly sampling H out of D embedding dimensions), combined with task-aware batch construction tailored for MIL optimization. (b) A rule-based parameterization strategy, determined according to dataset characteristics, is further employed to simplify and streamline both training and inference. (c), The simplified attention-based aggregator for slide-level prediction and uncertainty estimation. (d), Overview of the evaluation datasets encompassing 35 slide-level clinical tasks across three major categories: disease classification and subtyping (8 tasks), molecular biomarker detection (12 tasks), and prognosis prediction (15 tasks). (e), Overall ranking scores for MIL methods across all 35 slide-level clinical tasks and pathology foundation models (mean $\pm$ standard deviation). (f), Reliability analysis of model predictions based on uncertainty scores (Unc) in the left panel shows that removing cases with the highest uncertainty scores could improve model performance on disease diagnosis (BCCC 3 cls) in the right panel. (g), Comparison among conventional ABMIL trained with a batch size of 1, ABMIL with nnMIL training strategies (larger batch size with patch-level sampling and task-aware sampling), and the complete nnMIL framework across four pathology foundation models (Virchow2, GigaPath, UNI, and H0). These results demonstrate that nnMIL improves performance and generalizability regardless of the pathology foundation models used for feature extraction. WSI: whole-slide image; PathFMs: pathology foundation models; Cls: classification; Reg: Regression; P: prediction; Unc: uncertainty; BACC: balanced accuracy.
  • Figure 1: Details ranking score across disease diagnosis and subtyping, molecular biomarker detection, pan-cancer prognosis prediction and generalizability of prognostic models in four external cohorts. All results are reported as mean $\pm$ standard deviation.
  • Figure 2: Disease classification and subtyping.(a), Comparisons of MIL methods for performance on 8 disease classification and subtyping tasks across four pathology foundation models (GigaPath, H0, UNI and Virchow2). nnMIL was compared with six widely used MIL methods including DTFD, DSMIL, ILRA, TransMIL, WIKG and ABMIL. (b)-(i), Comparison across eight individual tasks using UNI as the feature extractor, including EBRAINS (30 cls Fine and 12 cls Coarse tasks, where cls means classes), PANDA (7 cls), IMP-CRC2024 (3 cls), BCCC (2-, 3-, and 5-cls tasks), and BRACS (7 cls). Performance was evaluated using balanced accuracy (BACC), except for PANDA, which was assessed with Cohen’s Kappa. (j)-(m), Performance analysis after excluding cases with the highest uncertainty scores. Statistical significance was determined using a two-sided Wilcoxon signed-rank test, where * indicates $P$ < 0.05, ** indicates $P$ < 0.01, and *** indicates $P$ < 0.001. All results are reported as mean ± standard deviation, estimated from 1,000 bootstrap replicates.
  • Figure 2: The relationship between prediction score and estimated slide-level uncertainty.(a), shows results for disease subtyping tasks, where uncertainty distinctly separates correct from incorrect predictions across EBRAINS, BCCC, PANDA, and IMP-CRC2024. (b), presents molecular biomarker detection tasks, including ER, BRAF, IDH, and TMB, showing that the same relationship between uncertainty and prediction accuracy is preserved, underscoring the model’s consistent calibration properties. Purple open circles indicate correctly classified cases, whereas orange crosses denote misclassified samples.
  • Figure 3: Molecular biomarker detection.(a), Average performance of nnMIL and comparative methods across 12 datasets representing 9 molecular biomarkers evaluated on four pathology foundation models (GigaPath, H0, UNI, and Virchow2). nnMIL was benchmarked against six widely used MIL methods, including DTFD, DSMIL, ILRA, TransMIL, WIKG, and ABMIL. (b)--(i), Comparisons across 12 individual datasets using Virchow2 as the feature extractor, covering three breast cancer biomarkers (ER, HER2, and PR) from the BCNB cohort; two colorectal cancer biomarkers (BRAF and KRAS) from the TCGA-CRC and MCO cohorts; one brain tumor biomarker (IDH) from the TCGA-LGG and TCGA-HGG cohorts; and three pan-cancer genomic biomarkers (WGD, TMB, and Aneuploidy) from TCGA. Performance was evaluated using the area under the curve (AUC), except for Aneuploidy, which was a regression task assessed using the Pearson correlation coefficient. (j)--(m), Performance analysis after excluding samples with the highest uncertainty scores from nnMIL. Statistical significance was determined using a two-sided Wilcoxon signed-rank test, where * indicates $P$ < 0.05, ** indicates $P$ < 0.01, and *** indicates $P$ < 0.001. All results are reported as mean $\pm$ standard deviation, estimated from 1,000 bootstrap replicates.
  • ...and 8 more figures