Fusing Pixels and Genes: Spatially-Aware Learning in Computational Pathology

Minghao Han; Dingkang Yang; Linhao Qu; Zizhi Chen; Gang Li; Han Wang; Jiacong Wang; Lihua Zhang

Fusing Pixels and Genes: Spatially-Aware Learning in Computational Pathology

Minghao Han, Dingkang Yang, Linhao Qu, Zizhi Chen, Gang Li, Han Wang, Jiacong Wang, Lihua Zhang

TL;DR

This paper proposes STAMP, a Spatial Transcriptomics-Augmented Multimodal Pathology representation learning framework that integrates spatially-resolved gene expression profiles to enable molecule-guided joint embedding of pathology images and transcriptomic data, and validates it across six datasets and four downstream tasks.

Abstract

Recent years have witnessed remarkable progress in multimodal learning within computational pathology. Existing models primarily rely on vision and language modalities; however, language alone lacks molecular specificity and offers limited pathological supervision, leading to representational bottlenecks. In this paper, we propose STAMP, a Spatial Transcriptomics-Augmented Multimodal Pathology representation learning framework that integrates spatially-resolved gene expression profiles to enable molecule-guided joint embedding of pathology images and transcriptomic data. Our study shows that self-supervised, gene-guided training provides a robust and task-agnostic signal for learning pathology image representations. Incorporating spatial context and multi-scale information further enhances model performance and generalizability. To support this, we constructed SpaVis-6M, the largest Visium-based spatial transcriptomics dataset to date, and trained a spatially-aware gene encoder on this resource. Leveraging hierarchical multi-scale contrastive alignment and cross-scale patch localization mechanisms, STAMP effectively aligns spatial transcriptomics with pathology images, capturing spatial structure and molecular variation. We validate STAMP across six datasets and four downstream tasks, where it consistently achieves strong performance. These results highlight the value and necessity of integrating spatially resolved molecular supervision for advancing multimodal learning in computational pathology. The code is included in the supplementary materials. The pretrained weights and SpaVis-6M are available at: https://github.com/Hanminghao/STAMP.

Fusing Pixels and Genes: Spatially-Aware Learning in Computational Pathology

TL;DR

Abstract

Paper Structure (39 sections, 11 equations, 10 figures, 7 tables, 2 algorithms)

This paper contains 39 sections, 11 equations, 10 figures, 7 tables, 2 algorithms.

Introduction
Related Work
Methodology
SpaVis-6M: The largest Visium-based Spatial Transcriptomics Dataset
Spatial-aware Gene Encoder Pretraining
Hierarchical Multi-scale Contrastive Alignment
Experiments and Results
Experimental Settings
Experiments on Linear Probing and Unsupervised Clustering
Experiments on Gene Expression Prediction
Experiments on WSI Classification
Ablation Study
Conclusion and Discussion
Ethics Statement
Reproducibility Statement
...and 24 more sections

Figures (10)

Figure 1: Gene-guided supervision boosts vision encoders. We evaluated unsupervised clustering on the DLPFC dataset. Models fine-tuned with spatial transcriptomics supervision and $\textsc{Stamp}$ consistently outperform baselines across four metrics (ARI, NMI, Silhouette, and Calinski-Harabasz), demonstrating that molecular information enhances biological structure identification.
Figure 2: Overview of $\textsc{Stamp}$'s Two-Stage Pretraining Framework. The framework is divided into two stages: (a) Spatial-aware Gene Encoder Pretraining uses 5.75 million spatial transcriptomics gene expression data to pretrain the gene encoder. (b) Hierarchical Multi-scale Contrastive Alignment adopts a pretrained pathological vision transformer as the vision encoder, aligning it with the gene encoder via hierarchical contrastive learning to fuse the two modalities.
Figure 3: Results of MIL-based WSI Classification. Comparison of $\textsc{Stamp}$ and baselines for WSI-level gene mutation state classification using ABMIL on LUAD-mutation dataset.
Figure 4: Visium-integrated Spatial Transcriptomics Dataset (SpaVis-6M): Comprehensive Overview.
Figure 5: Visualization of Linear Probing, t-SNE, and Unsupervised Clustering. Results for $\textsc{Stamp}$, scGPT-Spatial, and Hoptimus0 on different samples: a. Linear probing on sample 151673; b. t-SNE visualization on sample 151676; c. Unsupervised clustering on sample 151675.
...and 5 more figures

Fusing Pixels and Genes: Spatially-Aware Learning in Computational Pathology

TL;DR

Abstract

Fusing Pixels and Genes: Spatially-Aware Learning in Computational Pathology

Authors

TL;DR

Abstract

Table of Contents

Figures (10)