Table of Contents
Fetching ...

Multimodal Prototyping for cancer survival prediction

Andrew H. Song, Richard J. Chen, Guillaume Jaume, Anurag J. Vaidya, Alexander S. Baras, Faisal Mahmood

TL;DR

Addresses cancer survival prediction by fusing gigapixel WSIs and bulk transcriptomics. We propose MultiModal Prototyping (MMP), which compresses histology via Gaussian Mixture Model prototypes ($C_h \le 32$) and encodes transcriptomics into biological pathway prototypes ($C_g=50$ Hallmark pathways), enabling a compact, unsupervised multimodal token set. Fusion uses either Transformer attention or entropic-regularized Optimal Transport cross-alignment, with reduced complexity roughly $\mathcal{O}((C_g+C_h)^2)$, and yields improved performance across six TCGA cohorts while enabling bidirectional interpretability between morphologies and pathways. The approach delivers a scalable, interpretable framework for prognosis and risk stratification, with significant potential for clinical translation and further extensions to data-driven prototypes and single-cell-inspired genomics models.

Abstract

Multimodal survival methods combining gigapixel histology whole-slide images (WSIs) and transcriptomic profiles are particularly promising for patient prognostication and stratification. Current approaches involve tokenizing the WSIs into smaller patches (>10,000 patches) and transcriptomics into gene groups, which are then integrated using a Transformer for predicting outcomes. However, this process generates many tokens, which leads to high memory requirements for computing attention and complicates post-hoc interpretability analyses. Instead, we hypothesize that we can: (1) effectively summarize the morphological content of a WSI by condensing its constituting tokens using morphological prototypes, achieving more than 300x compression; and (2) accurately characterize cellular functions by encoding the transcriptomic profile with biological pathway prototypes, all in an unsupervised fashion. The resulting multimodal tokens are then processed by a fusion network, either with a Transformer or an optimal transport cross-alignment, which now operates with a small and fixed number of tokens without approximations. Extensive evaluation on six cancer types shows that our framework outperforms state-of-the-art methods with much less computation while unlocking new interpretability analyses.

Multimodal Prototyping for cancer survival prediction

TL;DR

Addresses cancer survival prediction by fusing gigapixel WSIs and bulk transcriptomics. We propose MultiModal Prototyping (MMP), which compresses histology via Gaussian Mixture Model prototypes () and encodes transcriptomics into biological pathway prototypes ( Hallmark pathways), enabling a compact, unsupervised multimodal token set. Fusion uses either Transformer attention or entropic-regularized Optimal Transport cross-alignment, with reduced complexity roughly , and yields improved performance across six TCGA cohorts while enabling bidirectional interpretability between morphologies and pathways. The approach delivers a scalable, interpretable framework for prognosis and risk stratification, with significant potential for clinical translation and further extensions to data-driven prototypes and single-cell-inspired genomics models.

Abstract

Multimodal survival methods combining gigapixel histology whole-slide images (WSIs) and transcriptomic profiles are particularly promising for patient prognostication and stratification. Current approaches involve tokenizing the WSIs into smaller patches (>10,000 patches) and transcriptomics into gene groups, which are then integrated using a Transformer for predicting outcomes. However, this process generates many tokens, which leads to high memory requirements for computing attention and complicates post-hoc interpretability analyses. Instead, we hypothesize that we can: (1) effectively summarize the morphological content of a WSI by condensing its constituting tokens using morphological prototypes, achieving more than 300x compression; and (2) accurately characterize cellular functions by encoding the transcriptomic profile with biological pathway prototypes, all in an unsupervised fashion. The resulting multimodal tokens are then processed by a fusion network, either with a Transformer or an optimal transport cross-alignment, which now operates with a small and fixed number of tokens without approximations. Extensive evaluation on six cancer types shows that our framework outperforms state-of-the-art methods with much less computation while unlocking new interpretability analyses.
Paper Structure (46 sections, 2 theorems, 22 equations, 3 figures, 9 tables)

This paper contains 46 sections, 2 theorems, 22 equations, 3 figures, 9 tables.

Key Result

Lemma 3.1

Let $\mathbf{Z}_\text{g.}\in\mathbb{R}^{C_\text{g.}\times d}$ and $\mathbf{Z}_\text{h.}\in\mathbb{R}^{C_\text{h.}\times d}$ be the matrix representation of the token sets $\{\mathbf{z}_{i,\text{g.}} \}_{i=1}^{C_\text{g.}}$ and $\{\mathbf{z}_{k,\text{h.}} \}_{k=1}^{C_\text{h.}}$. Let $\mathbf{Z}_{\te

Figures (3)

  • Figure 1: Overview of $\textsc{MMP}$. (A) The tessellated WSI patches (tokens) are projected to low-dimensional embeddings with a pretrained patch encoder. The patch embeddings ($N_\text{h.}>10^4$) are aggregated to slide summary using a small set of prototypes ($C_\text{h.}<$32). (B) The transcriptomics data is projected onto a set of binary vectors indicating the presence of specific genes in each pathway, forming pathway summary. (C) The post-aggregation embeddings from both modalities are first matched to the same dimension. Cross-modal interactions between histology and transcriptomics are learned with a Transformer or an Optimal Transport, with intra-modal interactions learned with Transformer-based self-attention. The attended embeddings are aggregated to form a patient-level embedding used for risk prediction.
  • Figure 2: Cross-modal interaction visualization. (A) A WSI for a BRCA patient. (B) The morphological prototype heatmap for $c=13$ (C13), representing invasive ductal carcinoma (IDC), based on the posterior distribution for C13. (C) Prototype assignment map showing the closest morphological prototype for each patch in the WSI. (D) Top-3 patches for each morphological prototype and proportion of each prototype in the WSI. (E) The top-10 pathways attending to C13. (F) Top-6 morphological prototypes attending to the pathways in (E).
  • Figure 3: Cross-modal interaction visualization. (A) BRCA WSIs with their prototype assignment map (categorical assignment of each histology patch to their nearest prototype), and prototype heatmaps of the the top-3 prominent tissue patterns in the WSI. (B) Morphological annotations provided by a board-certified pathologist of the nearest histology patches for each prototype. (C) For each prototype visualized in (A), we can visualize its most highly-attended pathways (h. $\rightarrow$ g.), i.e., which pathways correspond to the queried prototype (pathway importance).

Theorems & Definitions (4)

  • Lemma 3.1
  • proof
  • Lemma 2.1
  • proof