Table of Contents
Fetching ...

Multistain Pretraining for Slide Representation Learning in Pathology

Guillaume Jaume, Anurag Vaidya, Andrew Zhang, Andrew H. Song, Richard J. Chen, Sharifa Sahai, Dandan Mo, Emilio Madrigal, Long Phi Le, Faisal Mahmood

TL;DR

This work introduces Madeleine, a multimodal self-supervised pretraining framework for histopathology slides that treats immunohistochemical and special stains as distinct views of the same tissue. It combines a stain-agnostic patch encoder with a multi-head MIL slide encoder and a dual global-local cross-stain objective: a global InfoNCE alignment across stains and a local Graph Optimal Transport (GOT) alignment of patch embeddings, complemented by an optional intra-modal loss. Trained on large breast and kidney cohorts, Madeleine yields stain-agnostic slide representations that transfer effectively to diverse downstream tasks, including morphology, molecular status, prognosis, and IHC quantification, with strong few-shot and full-finetuning performance. The results demonstrate the value of multistain pretraining in computational pathology, offering interpretable attention insights and potential to scale to additional stains and modalities for broader clinical impact. Key contributions include (i) the Madeleine framework with a scalable, stain-agnostic slide encoder, (ii) large-scale pretraining on two organs with extensive stain diversity, and (iii) comprehensive evaluation across 21 tasks showing improved performance over MIL and intra-modal SSL baselines, including survival prediction and IHC quantification.

Abstract

Developing self-supervised learning (SSL) models that can learn universal and transferable representations of H&E gigapixel whole-slide images (WSIs) is becoming increasingly valuable in computational pathology. These models hold the potential to advance critical tasks such as few-shot classification, slide retrieval, and patient stratification. Existing approaches for slide representation learning extend the principles of SSL from small images (e.g., 224 x 224 patches) to entire slides, usually by aligning two different augmentations (or views) of the slide. Yet the resulting representation remains constrained by the limited clinical and biological diversity of the views. Instead, we postulate that slides stained with multiple markers, such as immunohistochemistry, can be used as different views to form a rich task-agnostic training signal. To this end, we introduce Madeleine, a multimodal pretraining strategy for slide representation learning. Madeleine is trained with a dual global-local cross-stain alignment objective on large cohorts of breast cancer samples (N=4,211 WSIs across five stains) and kidney transplant samples (N=12,070 WSIs across four stains). We demonstrate the quality of slide representations learned by Madeleine on various downstream evaluations, ranging from morphological and molecular classification to prognostic prediction, comprising 21 tasks using 7,299 WSIs from multiple medical centers. Code is available at https://github.com/mahmoodlab/MADELEINE.

Multistain Pretraining for Slide Representation Learning in Pathology

TL;DR

This work introduces Madeleine, a multimodal self-supervised pretraining framework for histopathology slides that treats immunohistochemical and special stains as distinct views of the same tissue. It combines a stain-agnostic patch encoder with a multi-head MIL slide encoder and a dual global-local cross-stain objective: a global InfoNCE alignment across stains and a local Graph Optimal Transport (GOT) alignment of patch embeddings, complemented by an optional intra-modal loss. Trained on large breast and kidney cohorts, Madeleine yields stain-agnostic slide representations that transfer effectively to diverse downstream tasks, including morphology, molecular status, prognosis, and IHC quantification, with strong few-shot and full-finetuning performance. The results demonstrate the value of multistain pretraining in computational pathology, offering interpretable attention insights and potential to scale to additional stains and modalities for broader clinical impact. Key contributions include (i) the Madeleine framework with a scalable, stain-agnostic slide encoder, (ii) large-scale pretraining on two organs with extensive stain diversity, and (iii) comprehensive evaluation across 21 tasks showing improved performance over MIL and intra-modal SSL baselines, including survival prediction and IHC quantification.

Abstract

Developing self-supervised learning (SSL) models that can learn universal and transferable representations of H&E gigapixel whole-slide images (WSIs) is becoming increasingly valuable in computational pathology. These models hold the potential to advance critical tasks such as few-shot classification, slide retrieval, and patient stratification. Existing approaches for slide representation learning extend the principles of SSL from small images (e.g., 224 x 224 patches) to entire slides, usually by aligning two different augmentations (or views) of the slide. Yet the resulting representation remains constrained by the limited clinical and biological diversity of the views. Instead, we postulate that slides stained with multiple markers, such as immunohistochemistry, can be used as different views to form a rich task-agnostic training signal. To this end, we introduce Madeleine, a multimodal pretraining strategy for slide representation learning. Madeleine is trained with a dual global-local cross-stain alignment objective on large cohorts of breast cancer samples (N=4,211 WSIs across five stains) and kidney transplant samples (N=12,070 WSIs across four stains). We demonstrate the quality of slide representations learned by Madeleine on various downstream evaluations, ranging from morphological and molecular classification to prognostic prediction, comprising 21 tasks using 7,299 WSIs from multiple medical centers. Code is available at https://github.com/mahmoodlab/MADELEINE.
Paper Structure (39 sections, 7 equations, 5 figures, 11 tables)

This paper contains 39 sections, 7 equations, 5 figures, 11 tables.

Figures (5)

  • Figure 1: Overview of $\textsc{Madeleine}$.a.Preprocessing: WSIs from various stains undergo tissue segmentation and patching into 256$\times$256-pixel tiles. Patch encoding: All patches are passed through a stain-agnostic Vision Transformer to extract patch embeddings augmented with a learnable stain-specific encoding. Slide encoding: Embeddings from each stain are sequentially passed through a pre-attention, a multi-head attention, and a post-attention module, resulting in stain-specific slide embeddings. b.$\textsc{Madeleine}$ is trained with a combination of global and local objectives. Global objective: Slide embeddings are aligned using a cross-modal contrastive objective (infoNCE). Local objective: Patch embeddings are aligned using a cross-modal local Graph Optimal Transport objective. c. The resulting stain-agnostic slide encoder can be used for various downstream tasks in few-shot and full fine-tuning settings.
  • Figure 1: Additional heatmap examples obtained with $\textsc{Madeleine}$A. Attention weights of multi-headed (frozen) ABMIL slide encoder pretrained with $\textsc{Madeleine}$ overlaid on three randomly chosen samples for TCGA Breast cohort. We show all heads and the average of heads. B. Attention weights of a single head (frozen) ABMIL slide encoder pretrained with $\textsc{Madeleine}$ overlaid on three randomly chosen samples for TCGA Breast cohort. Multi-headed ABMIL trained with $\textsc{Madeleine}$ can focus on different morphologies, whereas single-headed ABMIL focuses only on tumor morphology.
  • Figure 2: Few-shot performance of $\textsc{Madeleine}$ against baselines. All tasks are assessed on H&E-stained WSIs. Morphological subtyping is reported for $k$=10, molecular subtyping for $k$=25, and kidney transplant rejection for $k$=50. Each experiment is repeated ten times by sampling $k$ different samples per class. Besides HIPT and GigaSSL, all models use the same patch encoder. Each axis represents 10% AUC and each segment a 2% increment. Additional results for all $k$ values are provided in Supplementary 4 and 5.
  • Figure 3: Evaluation of $\textsc{Madeleine}$ and baselines on IHC quantification and survival prediction.a. We fine-tune $\textsc{Madeleine}$ for IHC quantification on the MGH cohort (N=962 ER and N=1,071 PR slides). 3-class and 6-class variants are derived from IHC scores extracted in pathology reports. Models are trained with $k$=25 examples per class. Random uses $\textsc{Madeleine}$ architecture trained from scratch; FineTune is initialized with $\textsc{Madeleine}$ pretrained weights. We report the mean and standard deviation (std) on a 5-fold label-stratified train-test study. b. Survival prediction on TCGA Breast (N=1,041 slides). We report mean and std using a 5-fold site-stratified cross-validation. "SE" is $\textsc{Madeleine}$ with stain encodings. c. Molecular subtyping of $\textsc{Madeleine}$ fine-tuned on AIDPATH (N=48 for HER2 and N=50 for KI67) and BCNB (N=1,058). Evaluation using 5-fold cross-validation. MIL refers to the best of four MIL baselines.
  • Figure 4: $\textsc{Madeleine}$ attention weight visualization in a breast cancer case. Attention weights of the third (focusing on tumor, annotated in red) and fourth (focusing on non-tumor regions, annotated in green) heads of $\textsc{Madeleine}$ slide encoder along with high attention patches per head.