Multimodal Prototyping for cancer survival prediction

Andrew H. Song; Richard J. Chen; Guillaume Jaume; Anurag J. Vaidya; Alexander S. Baras; Faisal Mahmood

Multimodal Prototyping for cancer survival prediction

Andrew H. Song, Richard J. Chen, Guillaume Jaume, Anurag J. Vaidya, Alexander S. Baras, Faisal Mahmood

TL;DR

Addresses cancer survival prediction by fusing gigapixel WSIs and bulk transcriptomics. We propose MultiModal Prototyping (MMP), which compresses histology via Gaussian Mixture Model prototypes ($C_h \le 32$) and encodes transcriptomics into biological pathway prototypes ($C_g=50$ Hallmark pathways), enabling a compact, unsupervised multimodal token set. Fusion uses either Transformer attention or entropic-regularized Optimal Transport cross-alignment, with reduced complexity roughly $\mathcal{O}((C_g+C_h)^2)$, and yields improved performance across six TCGA cohorts while enabling bidirectional interpretability between morphologies and pathways. The approach delivers a scalable, interpretable framework for prognosis and risk stratification, with significant potential for clinical translation and further extensions to data-driven prototypes and single-cell-inspired genomics models.

Abstract

Multimodal survival methods combining gigapixel histology whole-slide images (WSIs) and transcriptomic profiles are particularly promising for patient prognostication and stratification. Current approaches involve tokenizing the WSIs into smaller patches (>10,000 patches) and transcriptomics into gene groups, which are then integrated using a Transformer for predicting outcomes. However, this process generates many tokens, which leads to high memory requirements for computing attention and complicates post-hoc interpretability analyses. Instead, we hypothesize that we can: (1) effectively summarize the morphological content of a WSI by condensing its constituting tokens using morphological prototypes, achieving more than 300x compression; and (2) accurately characterize cellular functions by encoding the transcriptomic profile with biological pathway prototypes, all in an unsupervised fashion. The resulting multimodal tokens are then processed by a fusion network, either with a Transformer or an optimal transport cross-alignment, which now operates with a small and fixed number of tokens without approximations. Extensive evaluation on six cancer types shows that our framework outperforms state-of-the-art methods with much less computation while unlocking new interpretability analyses.

Multimodal Prototyping for cancer survival prediction

TL;DR

Addresses cancer survival prediction by fusing gigapixel WSIs and bulk transcriptomics. We propose MultiModal Prototyping (MMP), which compresses histology via Gaussian Mixture Model prototypes (

) and encodes transcriptomics into biological pathway prototypes (

Hallmark pathways), enabling a compact, unsupervised multimodal token set. Fusion uses either Transformer attention or entropic-regularized Optimal Transport cross-alignment, with reduced complexity roughly

, and yields improved performance across six TCGA cohorts while enabling bidirectional interpretability between morphologies and pathways. The approach delivers a scalable, interpretable framework for prognosis and risk stratification, with significant potential for clinical translation and further extensions to data-driven prototypes and single-cell-inspired genomics models.

Abstract

Paper Structure (46 sections, 2 theorems, 22 equations, 3 figures, 9 tables)

This paper contains 46 sections, 2 theorems, 22 equations, 3 figures, 9 tables.

Introduction
Related Work
Representing sets with prototypes
Prognostication with multimodal fusion
Methods
Prototype-based encoding
Morphological prototypes (Histology)
Pathway prototypes (Genomics)
Multimodal fusion
Token dimension matching
Multimodal fusion
Connection between transformer and Optimal transport cross-alignment
Survival prediction
Enhancing prototypes
Experiments
...and 31 more sections

Key Result

Lemma 3.1

Let $\mathbf{Z}_\text{g.}\in\mathbb{R}^{C_\text{g.}\times d}$ and $\mathbf{Z}_\text{h.}\in\mathbb{R}^{C_\text{h.}\times d}$ be the matrix representation of the token sets $\{\mathbf{z}_{i,\text{g.}} \}_{i=1}^{C_\text{g.}}$ and $\{\mathbf{z}_{k,\text{h.}} \}_{k=1}^{C_\text{h.}}$. Let $\mathbf{Z}_{\te

Figures (3)

Figure 1: Overview of $\textsc{MMP}$. (A) The tessellated WSI patches (tokens) are projected to low-dimensional embeddings with a pretrained patch encoder. The patch embeddings ($N_\text{h.}>10^4$) are aggregated to slide summary using a small set of prototypes ($C_\text{h.}<$32). (B) The transcriptomics data is projected onto a set of binary vectors indicating the presence of specific genes in each pathway, forming pathway summary. (C) The post-aggregation embeddings from both modalities are first matched to the same dimension. Cross-modal interactions between histology and transcriptomics are learned with a Transformer or an Optimal Transport, with intra-modal interactions learned with Transformer-based self-attention. The attended embeddings are aggregated to form a patient-level embedding used for risk prediction.
Figure 2: Cross-modal interaction visualization. (A) A WSI for a BRCA patient. (B) The morphological prototype heatmap for $c=13$ (C13), representing invasive ductal carcinoma (IDC), based on the posterior distribution for C13. (C) Prototype assignment map showing the closest morphological prototype for each patch in the WSI. (D) Top-3 patches for each morphological prototype and proportion of each prototype in the WSI. (E) The top-10 pathways attending to C13. (F) Top-6 morphological prototypes attending to the pathways in (E).
Figure 3: Cross-modal interaction visualization. (A) BRCA WSIs with their prototype assignment map (categorical assignment of each histology patch to their nearest prototype), and prototype heatmaps of the the top-3 prominent tissue patterns in the WSI. (B) Morphological annotations provided by a board-certified pathologist of the nearest histology patches for each prototype. (C) For each prototype visualized in (A), we can visualize its most highly-attended pathways (h. $\rightarrow$ g.), i.e., which pathways correspond to the queried prototype (pathway importance).

Theorems & Definitions (4)

Lemma 3.1
proof
Lemma 2.1
proof

Multimodal Prototyping for cancer survival prediction

TL;DR

Abstract

Multimodal Prototyping for cancer survival prediction

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (4)