Multimodal contrastive learning for spatial gene expression prediction using histology images

Wenwen Min; Zhiceng Shi; Jun Zhang; Jun Wan; Changmiao Wang

Multimodal contrastive learning for spatial gene expression prediction using histology images

Wenwen Min, Zhiceng Shi, Jun Zhang, Jun Wan, Changmiao Wang

TL;DR

This work tackles the cost barrier of spatial transcriptomics by predicting spatial gene expression from readily available H&E histology images. It introduces mclSTExp, a multimodal framework that uses a Transformer-based spot encoder to capture spatial context and a contrastive learning module to fuse image features with spot features, producing a shared embedding space aligned by a CLIP-like objective. The method achieves superior prediction accuracy across three cancer datasets, enables interpretation of cancer- and immune-related genes, and supports spatial domain detection, highlighting its potential for scalable, clinically relevant spatial transcriptomics analysis. Overall, mclSTExp provides a cost-effective, accurate approach for inferring spatial gene expression from histology, with practical implications for cancer biology and pathology.

Abstract

In recent years, the advent of spatial transcriptomics (ST) technology has unlocked unprecedented opportunities for delving into the complexities of gene expression patterns within intricate biological systems. Despite its transformative potential, the prohibitive cost of ST technology remains a significant barrier to its widespread adoption in large-scale studies. An alternative, more cost-effective strategy involves employing artificial intelligence to predict gene expression levels using readily accessible whole-slide images (WSIs) stained with Hematoxylin and Eosin (H\&E). However, existing methods have yet to fully capitalize on multimodal information provided by H&E images and ST data with spatial location. In this paper, we propose \textbf{mclSTExp}, a multimodal contrastive learning with Transformer and Densenet-121 encoder for Spatial Transcriptomics Expression prediction. We conceptualize each spot as a "word", integrating its intrinsic features with spatial context through the self-attention mechanism of a Transformer encoder. This integration is further enriched by incorporating image features via contrastive learning, thereby enhancing the predictive capability of our model. Our extensive evaluation of \textbf{mclSTExp} on two breast cancer datasets and a skin squamous cell carcinoma dataset demonstrates its superior performance in predicting spatial gene expression. Moreover, mclSTExp has shown promise in interpreting cancer-specific overexpressed genes, elucidating immune-related genes, and identifying specialized spatial domains annotated by pathologists. Our source code is available at https://github.com/shizhiceng/mclSTExp.

Multimodal contrastive learning for spatial gene expression prediction using histology images

TL;DR

Abstract

Paper Structure (13 sections, 16 equations, 5 figures, 1 table)

This paper contains 13 sections, 16 equations, 5 figures, 1 table.

Introduction
Materials and Methods
Dataset description
Overview of mclSTExp
Image and Spot encoders
Contrastive learning module
Weight aggregation module
Results
mclSTExp can improve the prediction accuracy
Visualization of the predicted gene expression
Spatial region detection
Ablation studies
Discussion and Conclusion

Figures (5)

Figure 1: The architecture of the proposed mclSTExp model. Step 1: mclSTExp seamlessly integrates spot features with their positional information using the self-attention mechanism of Transformer. Subsequently, it fuses H&E image information through contrastive learning, thus learning a multi-modal embedding space enriched with diverse features. Step 2: Projected image patches into the learned multimodal embedding space to query the expressions of the nearest k spotsl; inferred the gene expression of the test image by weighted aggregation of these queried spot expressions.
Figure 2: Evaluation of gene expression prediction on the HER2+ datasets by the PCCs between the observed and predicted gene expression by STnet he2020integrating, HisToGene pang2021leveraging, His2ST zeng2022spatial, THItoGene jia2024thitogene , BLEEP xie2024spatially and mclSTExp.
Figure 3: Evaluation of gene expression prediction on the cSCC datasets by the PCCs between the observed and predicted gene expression by STnet he2020integrating, HisToGene pang2021leveraging, His2ST zeng2022spatial, THItoGene jia2024thitogene, BLEEP xie2024spatially and mclSTExp.
Figure 4: Evaluation of gene expression prediction on the Alex+10x datasets by the PCCs between the observed and predicted gene expression by STnet he2020integrating, HisToGene pang2021leveraging, His2ST zeng2022spatial, THItoGene jia2024thitogene, BLEEP xie2024spatially and mclSTExp.
Figure 5: Visualize the top seven predicted genes in the HER2+ dataset based on the highest average $-\log_{10}$ (P-values) calculated across all tissue sections. The P-values are determined based on the correlation between predicted and observed gene expressions. For each of these seven genes, select the tissue section predicted by our model with the smallest P-value for visualization.

Multimodal contrastive learning for spatial gene expression prediction using histology images

TL;DR

Abstract

Multimodal contrastive learning for spatial gene expression prediction using histology images

Authors

TL;DR

Abstract

Table of Contents

Figures (5)