Table of Contents
Fetching ...

ST-Align: A Multimodal Foundation Model for Image-Gene Alignment in Spatial Transcriptomics

Yuxiang Lin, Ling Luo, Ying Chen, Xushi Zhang, Zihui Wang, Wenxian Yang, Mengsha Tong, Rongshan Yu

TL;DR

ST-Align is introduced, the first foundation model designed for ST that deeply aligns image-gene pairs by incorporating spatial context, effectively bridging pathological imaging with genomic features and highlights the potential for reducing the cost of ST and providing valuable insights into the distinction of critical compositions within human tissue.

Abstract

Spatial transcriptomics (ST) provides high-resolution pathological images and whole-transcriptomic expression profiles at individual spots across whole-slide scales. This setting makes it an ideal data source to develop multimodal foundation models. Although recent studies attempted to fine-tune visual encoders with trainable gene encoders based on spot-level, the absence of a wider slide perspective and spatial intrinsic relationships limits their ability to capture ST-specific insights effectively. Here, we introduce ST-Align, the first foundation model designed for ST that deeply aligns image-gene pairs by incorporating spatial context, effectively bridging pathological imaging with genomic features. We design a novel pretraining framework with a three-target alignment strategy for ST-Align, enabling (1) multi-scale alignment across image-gene pairs, capturing both spot- and niche-level contexts for a comprehensive perspective, and (2) cross-level alignment of multimodal insights, connecting localized cellular characteristics and broader tissue architecture. Additionally, ST-Align employs specialized encoders tailored to distinct ST contexts, followed by an Attention-Based Fusion Network (ABFN) for enhanced multimodal fusion, effectively merging domain-shared knowledge with ST-specific insights from both pathological and genomic data. We pre-trained ST-Align on 1.3 million spot-niche pairs and evaluated its performance through two downstream tasks across six datasets, demonstrating superior zero-shot and few-shot capabilities. ST-Align highlights the potential for reducing the cost of ST and providing valuable insights into the distinction of critical compositions within human tissue.

ST-Align: A Multimodal Foundation Model for Image-Gene Alignment in Spatial Transcriptomics

TL;DR

ST-Align is introduced, the first foundation model designed for ST that deeply aligns image-gene pairs by incorporating spatial context, effectively bridging pathological imaging with genomic features and highlights the potential for reducing the cost of ST and providing valuable insights into the distinction of critical compositions within human tissue.

Abstract

Spatial transcriptomics (ST) provides high-resolution pathological images and whole-transcriptomic expression profiles at individual spots across whole-slide scales. This setting makes it an ideal data source to develop multimodal foundation models. Although recent studies attempted to fine-tune visual encoders with trainable gene encoders based on spot-level, the absence of a wider slide perspective and spatial intrinsic relationships limits their ability to capture ST-specific insights effectively. Here, we introduce ST-Align, the first foundation model designed for ST that deeply aligns image-gene pairs by incorporating spatial context, effectively bridging pathological imaging with genomic features. We design a novel pretraining framework with a three-target alignment strategy for ST-Align, enabling (1) multi-scale alignment across image-gene pairs, capturing both spot- and niche-level contexts for a comprehensive perspective, and (2) cross-level alignment of multimodal insights, connecting localized cellular characteristics and broader tissue architecture. Additionally, ST-Align employs specialized encoders tailored to distinct ST contexts, followed by an Attention-Based Fusion Network (ABFN) for enhanced multimodal fusion, effectively merging domain-shared knowledge with ST-specific insights from both pathological and genomic data. We pre-trained ST-Align on 1.3 million spot-niche pairs and evaluated its performance through two downstream tasks across six datasets, demonstrating superior zero-shot and few-shot capabilities. ST-Align highlights the potential for reducing the cost of ST and providing valuable insights into the distinction of critical compositions within human tissue.

Paper Structure

This paper contains 19 sections, 12 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Comparison between WSI-Bulk Transcriptomics and ST Data. ST enables the integration of high-resolution histopathological images with whole-transcriptomic gene expression profiles at the level of individual spots across the entire slide. In contrast, bulk transcriptomics averages gene expression across heterogeneous cell populations, lacking spatial resolution and the ability to correlate gene expression with specific regions or patches within WSIs.
  • Figure 2: Overview of ST-Align Architecture. (a) Paired WSI and GEP data are segmented into spot-level patches, which are then grouped into niche-level data. A compressed feature for each paired spot-level gene and niche-level image is encoded using a feature extractor pretrained on a large dataset, while spot-level images and niche-level genes are encoded using trainable encoder. In addition, We not only aligned image feature and gene feature at spot-level and niche-level, but also aligend spot-niche fusion feature. (b) The KNN algorithm is used to cluster spot-level data to obtain niche-level data. (c) Attention based fusion network.
  • Figure 3: Zero-shot Spatial Clustering Results. The performance of methods in identifying spatial domains was evaluated by comparing ST-Align with existing methods CLIP and PLIP, using human annotation as the ground truth. Each row represents distinct slices (151509 and 151673) derived from different samples. Each color corresponds to a distinct spatial region, ranging from WM (White Matter) to L1.