Table of Contents
Fetching ...

Beyond Multiple Instance Learning: Full Resolution All-In-Memory End-To-End Pathology Slide Modeling

Gabriele Campanella, Eugene Fluder, Jennifer Zeng, Chad Vanderbilt, Thomas J. Fuchs

TL;DR

The work tackles the challenge of training on gigapixel pathology slides end-to-end by introducing an in-memory, multi-GPU framework that jointly optimizes a tile encoder and a slide aggregator at full resolution. It demonstrates gradient equivalence between single- and multi-GPU execution and validates the approach across diverse tasks, including EGFR mutation prediction in lung adenocarcinoma, whole-slide breast cancer detection, and fine-tuning a ViT-base pathology foundation model. A key finding is that increasing the number of tiles sampled per slide ($K$) reduces training loss and boosts validation performance, with end-to-end fine-tuning outperforming frozen-encoder baselines. While computationally intensive, the framework offers a flexible path to scalable foundation-model development in pathology, adaptable to various encoder/aggregator choices and slide-level supervision signals.

Abstract

Artificial Intelligence (AI) has great potential to improve health outcomes by training systems on vast digitized clinical datasets. Computational Pathology, with its massive amounts of microscopy image data and impact on diagnostics and biomarkers, is at the forefront of this development. Gigapixel pathology slides pose a unique challenge due to their enormous size and are usually divided into tens of thousands of smaller tiles for analysis. This results in a discontinuity in the machine learning process by separating the training of tile-level encoders from slide-level aggregators and the need to adopt weakly supervised learning strategies. Training models from entire pathology slides end-to-end has been largely unexplored due to its computational challenges. To overcome this problem, we propose a novel approach to jointly train both a tile encoder and a slide-aggregator fully in memory and end-to-end at high-resolution, bridging the gap between input and slide-level supervision. While more computationally expensive, detailed quantitative validation shows promise for large-scale pre-training and fine-tuning of pathology foundation models.

Beyond Multiple Instance Learning: Full Resolution All-In-Memory End-To-End Pathology Slide Modeling

TL;DR

The work tackles the challenge of training on gigapixel pathology slides end-to-end by introducing an in-memory, multi-GPU framework that jointly optimizes a tile encoder and a slide aggregator at full resolution. It demonstrates gradient equivalence between single- and multi-GPU execution and validates the approach across diverse tasks, including EGFR mutation prediction in lung adenocarcinoma, whole-slide breast cancer detection, and fine-tuning a ViT-base pathology foundation model. A key finding is that increasing the number of tiles sampled per slide () reduces training loss and boosts validation performance, with end-to-end fine-tuning outperforming frozen-encoder baselines. While computationally intensive, the framework offers a flexible path to scalable foundation-model development in pathology, adaptable to various encoder/aggregator choices and slide-level supervision signals.

Abstract

Artificial Intelligence (AI) has great potential to improve health outcomes by training systems on vast digitized clinical datasets. Computational Pathology, with its massive amounts of microscopy image data and impact on diagnostics and biomarkers, is at the forefront of this development. Gigapixel pathology slides pose a unique challenge due to their enormous size and are usually divided into tens of thousands of smaller tiles for analysis. This results in a discontinuity in the machine learning process by separating the training of tile-level encoders from slide-level aggregators and the need to adopt weakly supervised learning strategies. Training models from entire pathology slides end-to-end has been largely unexplored due to its computational challenges. To overcome this problem, we propose a novel approach to jointly train both a tile encoder and a slide-aggregator fully in memory and end-to-end at high-resolution, bridging the gap between input and slide-level supervision. While more computationally expensive, detailed quantitative validation shows promise for large-scale pre-training and fine-tuning of pathology foundation models.
Paper Structure (10 sections, 6 figures, 1 table)

This paper contains 10 sections, 6 figures, 1 table.

Figures (6)

  • Figure 1: a) H100 GPU memory usage for training ResNet encoders of different sizes. b) H100 GPU memory usage for training ViT encoders of different sizes. Full precision and AMP training are compared. c) Distribution of 20x magnification non-overlapping tissue tiles per slide in a large health system-level dataset.
  • Figure 2: Method overview. A distributed sampler generates appropriate batches of tissue tiles for each encoder rank in the DDP group. Features generated by the encoder ranks are gathered and concatenated in rank 0. Aggregator forward and backward passes are executed in rank 0. The gradients of the input features to the aggregator are split and scattered to the appropriate rank. In each rank a pseudo-loss is generated to continue backpropagation.
  • Figure 3: Gradient equivalence in real world networks. We tracked parameters and gradients of three layers in the network, including two convolutional layers from the encoder and the classification layer of the aggregator. Left) Normalized L1 norm of the difference between model parameters in single and multi-GPU experiments. Middle) Absolute difference in the loss between single and multi-GPU experiments. Right) Normalized L1 norm of the difference between model parameters’ gradients in single and multi-GPU experiments.
  • Figure 4: EGFR Mutation Prediction in LUAD. Each data point summarizes the 20 MCCV runs. a) Training loss convergence curves stratified by $K$ tiles per slide and GPU parallelization strategy. The shaded area is calculated by bootstrapping with 95% confidence intervals (CI). b) Final Training loss stratified by GPU parallelization strategy in relation to $K$ tiles per slide. The error bar is estimated via bootstrapping using 95% CI. c) Validation AUC stratified by GPU parallelization strategy in relation to $K$ tiles per slide. The error bar is the 95% CI average estimate calculated via bootstrapping. d) Comparison of validation AUCs between GPU parallelization strategies.
  • Figure 5: Breast Cancer Detection Experiment. a) Training loss convergence. b) Validation AUC convergence. Bars represent bootstrapped 95% confidence interval. c) ROC curve for the best validation result. The shaded region represents the bootstrapped 95% confidence interval.
  • ...and 1 more figures