A Multimodal Foundation Model of Spatial Transcriptomics and Histology for Biological Discovery and Clinical Prediction

Jinxi Xiang; Siyu Hou; Yuchen Li; Ryan Quinton; Xiaoming Zhang; Feyisope Eweje; Xiangde Luo; Yijiang Chen; Zhe Li; Colin Bergstrom; Ted Kim; Sierra Willens; Francesca Maria Olguin; Matthew Abikenari; Andrew Heider; Sanjeeth Rajaram; Joel Neal; Maximilian Diehn; Xiang Zhou; Ruijiang Li

A Multimodal Foundation Model of Spatial Transcriptomics and Histology for Biological Discovery and Clinical Prediction

Jinxi Xiang, Siyu Hou, Yuchen Li, Ryan Quinton, Xiaoming Zhang, Feyisope Eweje, Xiangde Luo, Yijiang Chen, Zhe Li, Colin Bergstrom, Ted Kim, Sierra Willens, Francesca Maria Olguin, Matthew Abikenari, Andrew Heider, Sanjeeth Rajaram, Joel Neal, Maximilian Diehn, Xiang Zhou, Ruijiang Li

Abstract

Spatial transcriptomics (ST) enables gene expression mapping within anatomical context but remains costly and low-throughput. Hematoxylin and eosin (H\&E) staining offers rich morphology yet lacks molecular resolution. We present \textbf{\ours} (\textbf{S}patial \textbf{T}ranscriptomics and hist\textbf{O}logy \textbf{R}epresentation \textbf{M}odel), a foundation model trained on 1.2 million spatially resolved transcriptomic profiles with matched histology across 18 organs. Using a hierarchical architecture integrating morphological features, gene expression, and spatial context, STORM bridges imaging and omics through robust molecular--morphological representations. STORM enhances spatial domain discovery, producing biologically coherent tissue maps, and outperforms existing methods in predicting spatial gene expression from H\&E images across 11 tumor types. The model is platform-agnostic, performing consistently across Visium, Xenium, Visium HD, and CosMx. Applied to 23 independent cohorts comprising 7,245 patients, STORM significantly improves immunotherapy response prediction and prognostication over established biomarkers, providing a scalable framework for spatially informed discovery and clinical precision medicine.

A Multimodal Foundation Model of Spatial Transcriptomics and Histology for Biological Discovery and Clinical Prediction

Abstract

Paper Structure

This paper contains 1 section, 4 equations, 10 figures.

Figures (10)

Figure 1: Study overview of the proposed ST-H&E foundation model, STORM.(A) We trained STORM with a large-scale human pan-tissue dataset comprising 632 paired ST–H&E slides across four major spatial transcriptomics platforms and 18 organs, totaling $\sim$1.2 million high-resolution tissue spots. (B) Model architecture and pretraining strategy. STORM employs a hierarchical design: first, spot-level foundation models are used to process each spot independently; Second, the image and ST features for each spot, together with the spatial coordinates, are then integrated by a spatial encoder which not only fuses H&E–ST information but also preserves spatial context. STORM is fully compatible with the heterogeneous ST platform, accommodating variations in spatial resolution and gene panels. (C) After pretraining, STORM can be applied to various tasks with minimal or no further fine-tuning. First, on comprehensive benchmarks for spatial domain discovery with H&E–ST inputs, STORM substantially outperforms OmiCLIPchen2025visual and other ST foundation models. Second, for virtual ST prediction from H&E images across 11 organ types, STORM surpasses other H&E–ST models and H&E-only foundation models. Finally, in large-scale real-world patient cohorts, we demonstrate the translational potential of STORM for clinical outcome prediction by integrating H&E and virtual ST, thereby eliminating the need for costly ST. STORM markedly improves predictive accuracy across 23 independent cohorts with 7,245, including pan-cancer prognosis prediction, and immunotherapy response prediction in melanoma, bladder (advanced and localized), endometrial, gastroesophageal, lung, and kidney cancers.
Figure 2: Multimodal H&E-ST spatial domain discovery with STORM.(A) Evaluation setup using a Visium ST datasetdawo_2025_14620362 with kidney cancer (KC) and lung cancer (LC) samples, each paired with H&E histology and pathologist annotations. (B-C) Spatial domain detection using multimodal H&E--ST embeddings of STORM outperforms the classical HVGs+PCA pipeline and foundation models (e.g., OmiCLIP), clearly separating tumor (TUM), normal (NOR), tertiary lymphoid structures (TLS), and infiltrate (INFL) regions. (D--F) Across eight independent samples, STORM achieves the highest concordance with ground truth (NMI/ARI) and superior spatial homogeneity, with robust performance across clustering resolutions. (G--K) In KC2, STORM identifies three tumor subdomains, with Domain 3 enriched for metabolic, inflammatory, and invasive programs (e.g., FABP7, RARRES2, IGFBP3, FGB/FGG, C3) and hallmark pathways linked to aggressive phenotypes, validated in TCGA‐KIRC as predicting poorer prognosis. (L--M) In KC3, fine-resolution clustering resolves TLS and infiltrates into distinct subtypes: INFL2, enriched in plasma cells and humoral immunity, and INFL1, enriched in T follicular helper cells, NK cells, and dendritic cells, indicating distinct immune activation states and interaction patterns.
Figure 3: Predicting spatial transcriptomics from H&E images.(A) Benchmark performance of STORM against competing methods for bladder, colorectal, lung, and prostate showed that STORM consistently achieved the highest Pearson correlation coefficient (PCC) in all datasets. Boxplots show gene-wise PCC between predicted and observed gene expression for 500 HVGs. The central line represents the median; boxes span the interquartile range (IQR, 25th--75th percentile). Whiskers extend to the 10th and 90th percentiles. (B) Models are trained to predict whole transcriptomes and then we report the PCC of the top 500 genes across eight tissue types. Boxplots display the distribution of PCC values, illustrating that STORM achieves consistently higher accuracy than OmiCLIP across all tissues. (C) PCCs as a function of top gene panel size (100--2,000 genes) on the colorectal dataset. Points indicate the median PCC at each panel size. (D) Visualization of spatial gene expression predicted from H&E for clinically relevant biomarkers. Rows show, from top to bottom: H&E image, measured gene expression, STORM predictions, and Virchow predictions. Compared with Virchow, STORM produced substantially higher correlations with ground-truth gene expression.
Figure 4: Prognosis prediction with multimodal integration by STORM.(A) Workflow for applying STORM to prognosis prediction based on routine H&E slides. First, STORM was used to generate virtual ST from routine H&E slides, yielding paired H&E images and virtual ST. Features from both modalities were encoded by STORM, integrated through a cross-attention module, and aggregated using an attention-based multiple instance learning (AbMIL) model. Final predictions were generated from the multimodal embedding via an MLP layer. We evaluated STORM using real-world large clinical cohorts of lung and colorectal cancers, with disease-specific survival (DSS) as the endpoint across training, validation, and independent test sets. (B)STORM significantly outperformed unimodal baseline models (H&E-only, virtual ST-only) and the H&E--ST model OmiCLIP for prognosis prediction. The improvements were highly significant ($p<0.0001$). Column plots show 1,000 bootstrap samples, with bars indicating mean values and error bars representing standard deviations. (C) Kaplan--Meier curves demonstrate significant patient stratification into high- and low-risk groups. A fixed cut-off value was applied to test cohorts. (D) Multivariate analysis adjusting for clinical risk factors, including age, grade, and stage. In both colorectal and lung cancer, STORM remained an independent prognostic factor ($p<0.001$). In the figure, $*$ denotes $p<0.05$; $**$ denotes $p<0.01$; and $***$ denotes $p<0.001$. (E) Comparison of STORM risk prediction with grade and stage. Integration of STORM with clinical variables further improved performance. Column plots show 1,000 bootstrap samples, with bars indicating mean values and error bars representing standard deviations.
Figure 5: Biological interpretation of prognosis prediction at the tissue, cell, and gene levels.(A) Tissue-level interpretation. Representative high- and low-risk colorectal and lung cancer ROIs from the validation set. High-risk colorectal cancer ROIs show classic adenocarcinoma features, while high-risk lung cancer ROIs are enriched with extracellular mucin characteristic of mucinous adenocarcinoma. Low-risk colorectal cancer ROIs display a strong TIL response, and low-risk lung cancer ROIs exhibit chronic inflammation, hemorrhage, and fibrotic stroma, all markers of a favorable prognosis. (B) Cell-level interpretation. Boxplots show the distribution of cell-type fractions across patients for cancer cells, lymphocytes, fibroblasts, plasma cells, macrophages, and endothelial cells. High-risk predictions were associated with enrichment of cancer cells, fibroblasts, and endothelial cells, whereas low-risk predictions were associated with increased lymphocytes and plasma cells. Macrophages were modestly enriched in low-risk patients. $p$-values from Welch's t-test are indicated. (C) Gene-level interpretation. Risk scores from STORM identified prognostic genes (CD74, HLA-A/C, B2M, HSP90AA1/AB1, EPAS1, EZR, XPO1) and enriched pathways (e.g., Interferon-Gamma Response, Unfolded Protein Response), highlighting proliferative, stress-response, and immune programs with therapeutic relevance.
...and 5 more figures