Beyond Multiple Instance Learning: Full Resolution All-In-Memory End-To-End Pathology Slide Modeling
Gabriele Campanella, Eugene Fluder, Jennifer Zeng, Chad Vanderbilt, Thomas J. Fuchs
TL;DR
The work tackles the challenge of training on gigapixel pathology slides end-to-end by introducing an in-memory, multi-GPU framework that jointly optimizes a tile encoder and a slide aggregator at full resolution. It demonstrates gradient equivalence between single- and multi-GPU execution and validates the approach across diverse tasks, including EGFR mutation prediction in lung adenocarcinoma, whole-slide breast cancer detection, and fine-tuning a ViT-base pathology foundation model. A key finding is that increasing the number of tiles sampled per slide ($K$) reduces training loss and boosts validation performance, with end-to-end fine-tuning outperforming frozen-encoder baselines. While computationally intensive, the framework offers a flexible path to scalable foundation-model development in pathology, adaptable to various encoder/aggregator choices and slide-level supervision signals.
Abstract
Artificial Intelligence (AI) has great potential to improve health outcomes by training systems on vast digitized clinical datasets. Computational Pathology, with its massive amounts of microscopy image data and impact on diagnostics and biomarkers, is at the forefront of this development. Gigapixel pathology slides pose a unique challenge due to their enormous size and are usually divided into tens of thousands of smaller tiles for analysis. This results in a discontinuity in the machine learning process by separating the training of tile-level encoders from slide-level aggregators and the need to adopt weakly supervised learning strategies. Training models from entire pathology slides end-to-end has been largely unexplored due to its computational challenges. To overcome this problem, we propose a novel approach to jointly train both a tile encoder and a slide-aggregator fully in memory and end-to-end at high-resolution, bridging the gap between input and slide-level supervision. While more computationally expensive, detailed quantitative validation shows promise for large-scale pre-training and fine-tuning of pathology foundation models.
