Table of Contents
Fetching ...

Towards Spatial Transcriptomics-driven Pathology Foundation Models

Konstantin Hemker, Andrew H. Song, Cristina Almagro-Pérez, Guillaume Jaume, Sophia J. Wagner, Anurag Vaidya, Nikola Simidjievski, Mateja Jamnik, Faisal Mahmood

TL;DR

A general framework for ST-guided finetuning of pathology foundation models is proposed, showing that augmenting existing models with localized molecular supervision is an effective and practical step for improving visual representations and expanding their cross-modal utility.

Abstract

Spatial transcriptomics (ST) provides spatially resolved measurements of gene expression, enabling characterization of the molecular landscape of human tissue beyond histological assessment as well as localized readouts that can be aligned with morphology. Concurrently, the success of multimodal foundation models that integrate vision with complementary modalities suggests that morphomolecular coupling between local expression and morphology can be systematically used to improve histological representations themselves. We introduce Spatial Expression-Aligned Learning (SEAL), a vision-omics self-supervised learning framework that infuses localized molecular information into pathology vision encoders. Rather than training new encoders from scratch, SEAL is designed as a parameter-efficient vision-omics finetuning method that can be flexibly applied to widely used pathology foundation models. We instantiate SEAL by training on over 700,000 paired gene expression spot-tissue region examples spanning tumor and normal samples from 14 organs. Tested across 38 slide-level and 15 patch-level downstream tasks, SEAL provides a drop-in replacement for pathology foundation models that consistently improves performance over widely used vision-only and ST prediction baselines on slide-level molecular status, pathway activity, and treatment response prediction, as well as patch-level gene expression prediction tasks. Additionally, SEAL encoders exhibit robust domain generalization on out-of-distribution evaluations and enable new cross-modal capabilities such as gene-to-image retrieval. Our work proposes a general framework for ST-guided finetuning of pathology foundation models, showing that augmenting existing models with localized molecular supervision is an effective and practical step for improving visual representations and expanding their cross-modal utility.

Towards Spatial Transcriptomics-driven Pathology Foundation Models

TL;DR

A general framework for ST-guided finetuning of pathology foundation models is proposed, showing that augmenting existing models with localized molecular supervision is an effective and practical step for improving visual representations and expanding their cross-modal utility.

Abstract

Spatial transcriptomics (ST) provides spatially resolved measurements of gene expression, enabling characterization of the molecular landscape of human tissue beyond histological assessment as well as localized readouts that can be aligned with morphology. Concurrently, the success of multimodal foundation models that integrate vision with complementary modalities suggests that morphomolecular coupling between local expression and morphology can be systematically used to improve histological representations themselves. We introduce Spatial Expression-Aligned Learning (SEAL), a vision-omics self-supervised learning framework that infuses localized molecular information into pathology vision encoders. Rather than training new encoders from scratch, SEAL is designed as a parameter-efficient vision-omics finetuning method that can be flexibly applied to widely used pathology foundation models. We instantiate SEAL by training on over 700,000 paired gene expression spot-tissue region examples spanning tumor and normal samples from 14 organs. Tested across 38 slide-level and 15 patch-level downstream tasks, SEAL provides a drop-in replacement for pathology foundation models that consistently improves performance over widely used vision-only and ST prediction baselines on slide-level molecular status, pathway activity, and treatment response prediction, as well as patch-level gene expression prediction tasks. Additionally, SEAL encoders exhibit robust domain generalization on out-of-distribution evaluations and enable new cross-modal capabilities such as gene-to-image retrieval. Our work proposes a general framework for ST-guided finetuning of pathology foundation models, showing that augmenting existing models with localized molecular supervision is an effective and practical step for improving visual representations and expanding their cross-modal utility.
Paper Structure (1 section, 9 equations, 9 figures, 54 tables)

This paper contains 1 section, 9 equations, 9 figures, 54 tables.

Table of Contents

  1. Ethics Statement

Figures (9)

  • Figure 1: Overview of SEAL.(A.) Despite progress in multimodal foundation models for pathology, self-supervised learning using fine-grained vision–omics pretraining remains unexplored. SEAL closes this gap by applying vision-omics finetuning to pretrained patch encoders, yielding vision embeddings that better encode local molecular information. (B.) Distribution of histology image patches and corresponding spatial transcriptomics (ST) expression profiles in MAPLE-Train-720k and MAPLE-Test-70K, used for training and evaluating SEAL. MAPLE includes cancerous (42.8%), diseased (19.4%), healthy (32.2%), and treated (5.6%) samples. (C.) Overview of SEAL dual-pretraining strategy. First, an ST autoencoder is trained (SEAL-omics) on an auxiliary transcriptomics reconstruction task ($\mathcal{L}_{\text{REC, omics}}$). Second, the vision encoder (SEAL-vision), equipped with low-rank adapters, is trained to simultaneously align with the embedding space of SEAL-omics ($\mathcal{L}_{\text{INFO}}$) and reconstruct the original gene panel ($\mathcal{L}_{\text{REC,vision}}$). (D.)SEAL-vision can be applied to a range of downstream tasks, such as molecular outcome prediction from whole-slide images and gene expression prediction. ST: Spatial Transcriptomics, HVG: Highly variable genes.
  • Figure 1: Slide-level morphological and IHC tasks & slide-level comparison with ST baselines.(A). Slide-level performance on two IHC slide-level prediction and four slide-level morphological classification tasks: IHC ER6 and PR6 expression prediction (6 classes, measured using Cohen's weighted kappa), Ebrains diagnosis classification (30 classes, measured using balanced accuracy), DHMC Kidney subtyping (5 classes, measured using balanced accuracy), BRACS subtyping as fine-grained (7 classes, measured using balanced accuracy) and coarse-grained (3 classes, measured using balanced accuracy) classification task. The chart shows the relative change of SEAL over the vision-only baseline for Uni-v2 and Virchow-v2. (B). Comparison of SEAL with the two best-performing patch-level fine-tuning methods (ST-Net, PathOmCLIP) on the full set of slide-level tasks: molecular status prediction, pathway expression, and marker & response prediction tasks. Performance is measured relative to the ST-Net baseline.
  • Figure 2: Overview of SEAL slide-level performance.(A.) Evaluation of five foundation models with and without SEAL on slide-level evaluations across 38 tasks covering molecular classification, pathway expression prediction, and marker and response prediction. The patch features from the baseline encoders and SEAL encoders are aggregated using attention-based multiple instance learning (ABMIL). (B). Slide-level performance comparison between Virchow-v2 finetuned with the SEAL or PathOmCLIP recipes, along with non-finetuned Virchow-v2 (Baseline) on the full set of tasks with ABMIL aggregation: molecular status prediction, pathway expression, and survival & treatment response prediction tasks. (C). t-SNE visualization comparing the robustness of Virchow-v2 and SEAL counterpart to the batch effects. Each slide was scanned with five different scanners, resulting in subtle differences in color, contrast, and resolution. Each dot represents the mean-pooled slide embedding for 100 identical slides captured using five commonly used WSI scanners. Stronger clusters indicate that the encoder is more prone to batch effects. (D). ARI and MI scores across five baseline encoders and SEAL counterparts, using the same setup as (C). Lower ARI and MI correspond to weaker cluster formation. ARI: Adjusted Random Index, MI: Mutual Information.
  • Figure 2: Comparison between original and $\textsc{SEAL}\xspace$-finetuned molecular prediction tasks.(A). Attention-based MIL (ABMIL) performance for 18 slide-level molecular subtyping tasks. ABMIL was applied on patch embeddings extracted from five vision encoders (Conch, H0-mini, Phikon-v2, Uni-v2, and Virchow-v2) and their $\textsc{SEAL}\xspace$-finetuned variants. Binary tasks are evaluated using AUC and multi-class classification tasks using balanced accuracy. (B). Same tasks as (A), replacing ABMIL by mean MIL.
  • Figure 3: Overview of SEAL patch-level task performance.(A.) (top) Average patch-based ST prediction performance on $\textsc{MAPLE}\xspace$-Test dataset (6 organs) across five foundation models for pathology (Conch, H0-mini, Phikon-v2, Virchow-v2, Uni-v2), evaluated using Pearson correlation coefficient (PCC). The circle size indicates the relative number of trainable parameters in each model, from ViT-Base (86M parameters) to ViT-Huge (680M). (bottom) Organ-level predictive performance of five foundation models and their SEAL variants on the $\textsc{MAPLE}\xspace$-Test dataset (six organs). (B.) Average performance of five pathology foundation models on HEST-Bench (9 histology-to–gene expression tasks) and corresponding absolute improvements obtained with SEAL relative to the original models. (C.) Performance comparison between Virchow-v2 SEAL encoder and other widely-used histology-to-ST prediction baselines on MAPLE-Test. All baselines were also trained using the Virchow-v2 encoder, except OmiCLIP, which uses its corresponding encoder.
  • ...and 4 more figures