Table of Contents
Fetching ...

Prototype-driven fusion of pathology and spatial transcriptomics for interpretable survival prediction

Lihe Liu, Xiaoxi Pan, Yinyin Yuan, Lulu Shang

TL;DR

PathoSpatial is presented as a proof-of-concept for scalable and interpretable multimodal learning for spatial omics-pathology fusion and inherently enables post-hoc prototype interpretation and molecular risk decomposition, providing quantitative, biologically grounded explanations.

Abstract

Whole slide images (WSIs) enable weakly supervised prognostic modeling via multiple instance learning (MIL). Spatial transcriptomics (ST) preserves in situ gene expression, providing a spatial molecular context that complements morphology. As paired WSI-ST cohorts scale to population level, leveraging their complementary spatial signals for prognosis becomes crucial; however, principled cross-modal fusion strategies remain limited for this paradigm. To this end, we introduce PathoSpatial, an interpretable end-to-end framework integrating co-registered WSIs and ST to learn spatially informed prognostic representations. PathoSpatial uses task-guided prototype learning within a multi-level experts architecture, adaptively orchestrating unsupervised within-modality discovery with supervised cross-modal aggregation. By design, PathoSpatial substantially strengthens interpretability while maintaining discriminative ability. We evaluate PathoSpatial on a triple-negative breast cancer cohort with paired ST and WSIs. PathoSpatial delivers strong and consistent performance across five survival endpoints, achieving superior or comparable performance to leading unimodal and multimodal methods. PathoSpatial inherently enables post-hoc prototype interpretation and molecular risk decomposition, providing quantitative, biologically grounded explanations, highlighting candidate prognostic factors. We present PathoSpatial as a proof-of-concept for scalable and interpretable multimodal learning for spatial omics-pathology fusion.

Prototype-driven fusion of pathology and spatial transcriptomics for interpretable survival prediction

TL;DR

PathoSpatial is presented as a proof-of-concept for scalable and interpretable multimodal learning for spatial omics-pathology fusion and inherently enables post-hoc prototype interpretation and molecular risk decomposition, providing quantitative, biologically grounded explanations.

Abstract

Whole slide images (WSIs) enable weakly supervised prognostic modeling via multiple instance learning (MIL). Spatial transcriptomics (ST) preserves in situ gene expression, providing a spatial molecular context that complements morphology. As paired WSI-ST cohorts scale to population level, leveraging their complementary spatial signals for prognosis becomes crucial; however, principled cross-modal fusion strategies remain limited for this paradigm. To this end, we introduce PathoSpatial, an interpretable end-to-end framework integrating co-registered WSIs and ST to learn spatially informed prognostic representations. PathoSpatial uses task-guided prototype learning within a multi-level experts architecture, adaptively orchestrating unsupervised within-modality discovery with supervised cross-modal aggregation. By design, PathoSpatial substantially strengthens interpretability while maintaining discriminative ability. We evaluate PathoSpatial on a triple-negative breast cancer cohort with paired ST and WSIs. PathoSpatial delivers strong and consistent performance across five survival endpoints, achieving superior or comparable performance to leading unimodal and multimodal methods. PathoSpatial inherently enables post-hoc prototype interpretation and molecular risk decomposition, providing quantitative, biologically grounded explanations, highlighting candidate prognostic factors. We present PathoSpatial as a proof-of-concept for scalable and interpretable multimodal learning for spatial omics-pathology fusion.
Paper Structure (52 sections, 29 equations, 5 figures, 7 tables)

This paper contains 52 sections, 29 equations, 5 figures, 7 tables.

Figures (5)

  • Figure 1: Task-guided Prototype Learning. (a) Prototype-based methods provide compact and effective representations of WSIs in MIL. Prior strategies have generally taken two forms, focusing on interpretability (fully unsupervised) or task-alignment (fully task-conditioned). We explore a hybrid task-guided prototype learning strategy balancing interpretability and task-guidance. (b) We apply task-guided prototype learning in the fusion of ST and H&E.
  • Figure 2: Model Overview. $\textit{PathoSpatial}\xspace$ integrates co-registered whole-slide images (WSIs) and spatial transcriptomics (ST) via modality-specific encoders and adaptive prototype learning. Each modality learns an evolving prototype bank that captures morphology- or gene-level patterns. A fusion module performs cross-modal multiple-instance aggregation to produce a unified patient-level representation. The prognostic prediction task provides task-guided feedback that refines prototypes, enhancing discriminative and biological relevance. Prototype activations produce spatially interpretable maps linking histological regions and molecular signatures to survival risk.
  • Figure 3: Prototype Interpretation and Risk Decomposition. (a) The whole slide image used in this sample. (b) The histology prototypes assignment and spatial location, we deliberately merged a few prototypes based on their spatial adjacency. Overall, they show high concordance with pathology annotation. (c) The ST prototypes assignment. (d) The UMAP of all ST spot features grouped by their prototypes and functional characterization in at the merged prototype level. (e) Per-prototype attention scores used by the model with interpretable semantics.
  • Figure S1: Extended Model Interpretability. (a) Visualization of histology prototype assignments alongside ground-truth pathology annotations, demonstrating spatial correspondence between learned concepts and tissue structures. (b) Quantitative analysis of tissue composition for each histology prototype, calculated as the proportion of pathology annotation categories within the top-100 attending patches. This confirms that prototypes specialize in distinct morphological patterns. (c) Hierarchical clustering of prototypes from both modalities. The distinct separation between Histology and ST clusters illustrates the non-redundant, complementary nature of the learned multimodal representations.
  • Figure S2: Modality-Oriented Patient Stratification.