A self-supervised framework for learning whole slide representations

Xinhai Hou; Cheng Jiang; Akhil Kondepudi; Yiwei Lyu; Asadur Chowdury; Honglak Lee; Todd C. Hollon

A self-supervised framework for learning whole slide representations

Xinhai Hou, Cheng Jiang, Akhil Kondepudi, Yiwei Lyu, Asadur Chowdury, Honglak Lee, Todd C. Hollon

TL;DR

This work tackles the challenge of learning transferable representations from gigapixel whole slide images without dense annotations. It introduces Slide Pre-trained Transformers (SPT), a two-stage framework that freezes a patch encoder and trains a WSI-level transformer using SSL with domain-informed two-view transformations of WSIs. The authors demonstrate that both self-supervised (ssSPT) and supervised (suSPT) variants outperform prior self-supervised and fully supervised MIL methods across five benchmarks, and that SPT improves performance across diverse patch encoders, including foundation models. Moreover, SPT yields interpretable self-attention maps on full WSIs, suggesting its potential as a foundation-model-style approach for computational pathology with broad practical impact.

Abstract

Whole slide imaging is fundamental to biomedical microscopy and computational pathology. Previously, learning representations for gigapixel-sized whole slide images (WSIs) has relied on multiple instance learning with weak labels, which do not annotate the diverse morphologic features and spatial heterogeneity of WSIs. A high-quality self-supervised learning method for WSIs would provide transferable visual representations for downstream computational pathology tasks, without the need for dense annotations. We present Slide Pre-trained Transformers (SPT) for gigapixel-scale self-supervision of WSIs. Treating WSI patches as tokens, SPT combines data transformation strategies from language and vision modeling into a general and unified framework to generate views of WSIs for self-supervised pretraining. SPT leverages the inherent regional heterogeneity, histologic feature variability, and information redundancy within WSIs to learn high-quality whole slide representations. We benchmark SPT visual representations on five diagnostic tasks across three biomedical microscopy datasets. SPT significantly outperforms baselines for histopathologic diagnosis, cancer subtyping, and genetic mutation prediction. Finally, we demonstrate that SPT consistently improves whole slide representations when using off-the-shelf, in-domain, and foundational patch encoders for whole slide multiple instance learning.

A self-supervised framework for learning whole slide representations

TL;DR

Abstract

Paper Structure (49 sections, 3 equations, 11 figures, 13 tables, 1 algorithm)

This paper contains 49 sections, 3 equations, 11 figures, 13 tables, 1 algorithm.

Introduction
Related Work
Computational pathology
Multiple instance learning
Self-supervised representation learning
Methods
The SPT framework
SPT with supervision.
SPT transformation
Transformation strategy.
SPT Implementation
Experiments
Benchmarks
SRH CNS benchmark.
H&E glioma molecular classification benchmark.
...and 34 more sections

Figures (11)

Figure 1: Self-supervised whole slide learning. Previous work in computational pathology relies on multiple instance learning with weak supervision from slide or patient-level labels to learn whole slide representations zhu2017wsisailse2018attentionlu2021datashao2021transmilchen2022scalingjaved2022additivechen2024towards. We present a self-supervised framework for learning whole slide representations, called Slide Pre-trained Transformers (SPT), by combining data transformations from vision and language modeling to generate high-quality paired views.
Figure 2: SPT overview. A. The SPT framework consists of a two-stage model architecture: 1) a pre-trained patch encoder $\mathcal{E}$ and 2) a transformer whole slide encoder $f$. WSIs are first divided into small patches, and the patch encoder extracts patch-level features. We then apply whole slide transformations to the patch tokens to create two views of the same WSI. The transformations combine splitting, cropping, and masking, which are informed by the structure and unique properties of WSIs. The transformed views are encoded by the transformer whole slide encoder, and the slide-level feature learning can use any paradigm. B. Example learning paradigms. In our experiments, we focus on three representative self-supervised paradigms, including SimCLR chen2020simple, BYOL grill2020bootstrap, and VICReg bardes2021vicreg, and supervised contrastive learning khosla2020supervised.
Figure 3: Limited effect of pixel-level patch augmentations. We qualitatively evaluate the effect of pixel-level augmentation on the patch representations by visualizing the tSNE plot of SimCLR pre-trained patch representations sampled from a single WSI. We observe that strong augmentations at the pixel level have a minimal effect on the patch embeddings. The invariant behavior of the patch encoder is explicitly enforced by the SimCLR pretext task.
Figure 4: SPT transformation strategy. SPT combines splitting, cropping, and masking to generate views, and they are motivated by the size, region diversity, and information redundancy of WSIs. Splitting partitions patches into mutually exclusive sets decreases mutual information between views; cropping can generate spatially diverse views covering different regions on the WSI; masking reduces redundant visual features and improves training efficiency. The combination of these transformations can create optimal positive pairs for whole slide representation learning.
Figure 5: SPT benchmarks with different patch encoders. ssSPT and suSPT offer performance boosts with a wide range of patch encoders. ssSPT approaches supervised performance upperbound. Additional metrics with error bars are in Appendix \ref{['app:ext:result']} Table \ref{['tab:supp.results.patch_encoders']}.
...and 6 more figures

A self-supervised framework for learning whole slide representations

TL;DR

Abstract

A self-supervised framework for learning whole slide representations

Authors

TL;DR

Abstract

Table of Contents

Figures (11)