Table of Contents
Fetching ...

PLUTO: Pathology-Universal Transformer

Dinkar Juyal, Harshith Padigela, Chintan Shah, Daniel Shenker, Natalia Harguindeguy, Yi Liu, Blake Martin, Yibo Zhang, Michael Nercessian, Miles Markey, Isaac Finberg, Kelsey Luu, Daniel Borders, Syed Ashar Javed, Emma Krause, Raymond Biju, Aashish Sood, Allen Ma, Jackson Nyman, John Shamshoian, Guillaume Chhor, Darpan Sanghavi, Marc Thibault, Limin Yu, Fedaa Najdawi, Jennifer A. Hipp, Darren Fahy, Benjamin Glass, Eric Walk, John Abel, Harsha Pokkalla, Andrew H. Beck, Sean Grullon

TL;DR

PLUTO addresses the challenge of pathology WSIs by learning universal embeddings with a light-weight, multi-scale transformer backbone pre-trained on a large, diverse, multi-site dataset. It integrates FlexiViT-based architecture with a composite self-supervised objective (DINOv2/iBOT) plus MAE and Fourier losses to capture multi-frequency information across four magnifications, then uses task-specific adaptation heads (MIL for slide-level, tile classification for tissue-level, and Mask R-CNN/Mask2Former for cellular/subcellular tasks) to cover hierarchical pathology tasks. Across public and proprietary benchmarks, PLUTO matches or surpasses task-specific baselines and larger pathology foundation models, while delivering improved deployability and robustness to domain shifts, demonstrated by strong ID and OOD performance on NSCLC, HER2, gland segmentation, and nuclei segmentation tasks. These results suggest that diverse pre-training data and multi-scale architectural design can yield a practical, universal pathology embedding for scalable clinical and translational use, and motivate further exploration of data diversity, architecture refinements, and deployment strategies in pathology foundation models.

Abstract

Pathology is the study of microscopic inspection of tissue, and a pathology diagnosis is often the medical gold standard to diagnose disease. Pathology images provide a unique challenge for computer-vision-based analysis: a single pathology Whole Slide Image (WSI) is gigapixel-sized and often contains hundreds of thousands to millions of objects of interest across multiple resolutions. In this work, we propose PathoLogy Universal TransfOrmer (PLUTO): a light-weight pathology FM that is pre-trained on a diverse dataset of 195 million image tiles collected from multiple sites and extracts meaningful representations across multiple WSI scales that enable a large variety of downstream pathology tasks. In particular, we design task-specific adaptation heads that utilize PLUTO's output embeddings for tasks which span pathology scales ranging from subcellular to slide-scale, including instance segmentation, tile classification, and slide-level prediction. We compare PLUTO's performance to other state-of-the-art methods on a diverse set of external and internal benchmarks covering multiple biologically relevant tasks, tissue types, resolutions, stains, and scanners. We find that PLUTO matches or outperforms existing task-specific baselines and pathology-specific foundation models, some of which use orders-of-magnitude larger datasets and model sizes when compared to PLUTO. Our findings present a path towards a universal embedding to power pathology image analysis, and motivate further exploration around pathology foundation models in terms of data diversity, architectural improvements, sample efficiency, and practical deployability in real-world applications.

PLUTO: Pathology-Universal Transformer

TL;DR

PLUTO addresses the challenge of pathology WSIs by learning universal embeddings with a light-weight, multi-scale transformer backbone pre-trained on a large, diverse, multi-site dataset. It integrates FlexiViT-based architecture with a composite self-supervised objective (DINOv2/iBOT) plus MAE and Fourier losses to capture multi-frequency information across four magnifications, then uses task-specific adaptation heads (MIL for slide-level, tile classification for tissue-level, and Mask R-CNN/Mask2Former for cellular/subcellular tasks) to cover hierarchical pathology tasks. Across public and proprietary benchmarks, PLUTO matches or surpasses task-specific baselines and larger pathology foundation models, while delivering improved deployability and robustness to domain shifts, demonstrated by strong ID and OOD performance on NSCLC, HER2, gland segmentation, and nuclei segmentation tasks. These results suggest that diverse pre-training data and multi-scale architectural design can yield a practical, universal pathology embedding for scalable clinical and translational use, and motivate further exploration of data diversity, architecture refinements, and deployment strategies in pathology foundation models.

Abstract

Pathology is the study of microscopic inspection of tissue, and a pathology diagnosis is often the medical gold standard to diagnose disease. Pathology images provide a unique challenge for computer-vision-based analysis: a single pathology Whole Slide Image (WSI) is gigapixel-sized and often contains hundreds of thousands to millions of objects of interest across multiple resolutions. In this work, we propose PathoLogy Universal TransfOrmer (PLUTO): a light-weight pathology FM that is pre-trained on a diverse dataset of 195 million image tiles collected from multiple sites and extracts meaningful representations across multiple WSI scales that enable a large variety of downstream pathology tasks. In particular, we design task-specific adaptation heads that utilize PLUTO's output embeddings for tasks which span pathology scales ranging from subcellular to slide-scale, including instance segmentation, tile classification, and slide-level prediction. We compare PLUTO's performance to other state-of-the-art methods on a diverse set of external and internal benchmarks covering multiple biologically relevant tasks, tissue types, resolutions, stains, and scanners. We find that PLUTO matches or outperforms existing task-specific baselines and pathology-specific foundation models, some of which use orders-of-magnitude larger datasets and model sizes when compared to PLUTO. Our findings present a path towards a universal embedding to power pathology image analysis, and motivate further exploration around pathology foundation models in terms of data diversity, architectural improvements, sample efficiency, and practical deployability in real-world applications.
Paper Structure (27 sections, 2 equations, 9 figures, 14 tables)

This paper contains 27 sections, 2 equations, 9 figures, 14 tables.

Figures (9)

  • Figure 1: Overview of PLUTO. Panel A) outlines the PLUTO multi-resolution adaptation pipeline. Tiles are extracted from WSIs at multiple resolutions and correspond to scales that capture different biological contexts. We organize pathology tasks according to these biological contexts as slide level, tissue level, and cellular & subcellular level tasks, respectively. PLUTO generates embeddings that are task-agnostic and can be used in a variety of downstream tasks, where adaptation to WSI-level prediction, tile classification, and instance segmentation are shown. Panel B) demonstrates the PLUTO architecture in detail. WSI tiles at multiple resolutions are masked with varying patch sizes and passed to the backbone for self-supervised pre-training. The architecture is optimized for flexibility across multiple scales and patch sizes. In addition to DINO and iBOT losses, MAE and Fourier losses are applied across varying mask sizes to control the amount of low- and high-frequency information that is preserved.
  • Figure 2: Dataset characterization for the pre-training dataset. The distribution of the dataset by organ, disease, stain, scanner, and objective magnification is shown, as well as the distribution of cell point and tissue region annotations which augment the pre-training dataset (NOS: Not Otherwise Specified). Aggregate data characteristics are summarized above these distributions which also indicate the number of biologically-meaningful objects and region types, which we term substances (e.g. lymphocyte, blood vessel, Gleason pattern 3 prostate cancer, tumor bed). The large number of source sites (50+) guarantees large diversity during PLUTO self-supervised pre-training.
  • Figure 3: Instance segmentation adaptation with PLUTO. In this figure, we demonstrate example task-specific inputs and outputs using the frozen PLUTO backbone, and our segmentation adaptation approach on top using an adapter that outputs maps at varying spatial and semantic resolutions, followed by a segmentation head to generate instance segmentation masks. We demonstrate that our approach works across object scales from nuclei (top two images) to glands (bottom) and across stain types. In our proprietary datasets, Gland segmentation is trained at $768\times768$ at $1$ mpp whereas nuclei segmentation is at $384\times384$ at $0.25$ mpp. In our public benchmark experiments, we follow the prescribed task setup.
  • Figure 4: Comparison of AdditiveMIL heatmaps (right) from PLUTO against ground truth ROIs (left) for HER2 scores $3+$, $2+$, $1+$, $0$. There is considerable alignment between the ground truth ROIs and the PLUTO model's region-level predictions, indicating that the model is learning biologically-relevant features when making slide-level predictions.
  • Figure 5: Linear and Attention Probing performance of PLUTO on proprietary tile classification benchmarks at the substance level. For each model, the left-most bar highlights the macro-F1 score on the dataset. PLUTO outperforms a fully supervised CNN baseline. Attentive pooling of patch-tokens provides more flexible adaptation and has the best performance across datasets. For the Oncology cell classification task, patch-token information from the central window is needed to capture context to classify the cell at the center pixel, so average pooling and attention pooling perform comparably. While patch embeddings do contain relevant information for the more complicated task of IBD tissue classification, performance significantly improves on applying attention pooling on top of them.
  • ...and 4 more figures