Table of Contents
Fetching ...

Unsupervised Foundation Model-Agnostic Slide-Level Representation Learning

Tim Lenz, Peter Neidlinger, Marta Ligero, Georg Wölflein, Marko van Treeck, Jakob Nikolas Kather

TL;DR

COBRA addresses the challenge of learning robust slide-level representations from pathology WSIs under weak supervision. It achieves this by a single-modality contrastive SSL that fuses tile embeddings from multiple foundation models via a Mamba-2 based encoder and a multi-head gated attention pooling, trained with an InfoNCE loss. The approach attains state-of-the-art results on CPTAC external validation with only 3048 TCGA WSIs used for pretraining and demonstrates compatibility with unseen FMs at inference, while benefiting from multi-magnification pretraining for data efficiency. COBRA’s interpretability is provided through tile-wise attention heatmaps without Grad-CAM, and its FM-agnostic design supports flexible integration of future patch FMs, making slide-level representations more accessible and generalizable in clinical contexts.

Abstract

Representation learning of pathology whole-slide images (WSIs) has primarily relied on weak supervision with Multiple Instance Learning (MIL). This approach leads to slide representations highly tailored to a specific clinical task. Self-supervised learning (SSL) has been successfully applied to train histopathology foundation models (FMs) for patch embedding generation. However, generating patient or slide level embeddings remains challenging. Existing approaches for slide representation learning extend the principles of SSL from patch level learning to entire slides by aligning different augmentations of the slide or by utilizing multimodal data. By integrating tile embeddings from multiple FMs, we propose a new single modality SSL method in feature space that generates useful slide representations. Our contrastive pretraining strategy, called COBRA, employs multiple FMs and an architecture based on Mamba-2. COBRA exceeds performance of state-of-the-art slide encoders on four different public Clinical Protemic Tumor Analysis Consortium (CPTAC) cohorts on average by at least +4.4% AUC, despite only being pretrained on 3048 WSIs from The Cancer Genome Atlas (TCGA). Additionally, COBRA is readily compatible at inference time with previously unseen feature extractors. Code available at https://github.com/KatherLab/COBRA.

Unsupervised Foundation Model-Agnostic Slide-Level Representation Learning

TL;DR

COBRA addresses the challenge of learning robust slide-level representations from pathology WSIs under weak supervision. It achieves this by a single-modality contrastive SSL that fuses tile embeddings from multiple foundation models via a Mamba-2 based encoder and a multi-head gated attention pooling, trained with an InfoNCE loss. The approach attains state-of-the-art results on CPTAC external validation with only 3048 TCGA WSIs used for pretraining and demonstrates compatibility with unseen FMs at inference, while benefiting from multi-magnification pretraining for data efficiency. COBRA’s interpretability is provided through tile-wise attention heatmaps without Grad-CAM, and its FM-agnostic design supports flexible integration of future patch FMs, making slide-level representations more accessible and generalizable in clinical contexts.

Abstract

Representation learning of pathology whole-slide images (WSIs) has primarily relied on weak supervision with Multiple Instance Learning (MIL). This approach leads to slide representations highly tailored to a specific clinical task. Self-supervised learning (SSL) has been successfully applied to train histopathology foundation models (FMs) for patch embedding generation. However, generating patient or slide level embeddings remains challenging. Existing approaches for slide representation learning extend the principles of SSL from patch level learning to entire slides by aligning different augmentations of the slide or by utilizing multimodal data. By integrating tile embeddings from multiple FMs, we propose a new single modality SSL method in feature space that generates useful slide representations. Our contrastive pretraining strategy, called COBRA, employs multiple FMs and an architecture based on Mamba-2. COBRA exceeds performance of state-of-the-art slide encoders on four different public Clinical Protemic Tumor Analysis Consortium (CPTAC) cohorts on average by at least +4.4% AUC, despite only being pretrained on 3048 WSIs from The Cancer Genome Atlas (TCGA). Additionally, COBRA is readily compatible at inference time with previously unseen feature extractors. Code available at https://github.com/KatherLab/COBRA.

Paper Structure

This paper contains 53 sections, 11 equations, 10 figures, 23 tables.

Figures (10)

  • Figure 1: cobra overview for self-supervised slide representation learning (A). A wsi is tessellated into patches at different magnifications (B) and encoded using different fm (C) to produce tile embeddings. The magnifications (B) and fm (C) serve as feature space augmentations to pretrain the cobra slide encoder (D) using contrastive self-supervised learning.
  • Figure 2: Few shot linear probing classification. Linear probing macro-AUC performance comparison for $k$ samples per class.
  • Figure 3: cobra Unsupervised Heatmap. Visualization of the weighting scores for the tiles of a WSI generated by Cobra for Patient-ID TCGA-CA-6716 from TCGA-CRC.
  • Figure 4: UMAP visualization of cobra’s patient-level slide embeddings for TCGA and CPTAC datasets at $0.5~$mpp. Each color represents a different tissue type, with five tissue types in total.
  • Figure 5: cobra Unsupervised Heatmap. Patient: TCGA-CA-6715
  • ...and 5 more figures