Multimodal Prototyping for cancer survival prediction
Andrew H. Song, Richard J. Chen, Guillaume Jaume, Anurag J. Vaidya, Alexander S. Baras, Faisal Mahmood
TL;DR
Addresses cancer survival prediction by fusing gigapixel WSIs and bulk transcriptomics. We propose MultiModal Prototyping (MMP), which compresses histology via Gaussian Mixture Model prototypes ($C_h \le 32$) and encodes transcriptomics into biological pathway prototypes ($C_g=50$ Hallmark pathways), enabling a compact, unsupervised multimodal token set. Fusion uses either Transformer attention or entropic-regularized Optimal Transport cross-alignment, with reduced complexity roughly $\mathcal{O}((C_g+C_h)^2)$, and yields improved performance across six TCGA cohorts while enabling bidirectional interpretability between morphologies and pathways. The approach delivers a scalable, interpretable framework for prognosis and risk stratification, with significant potential for clinical translation and further extensions to data-driven prototypes and single-cell-inspired genomics models.
Abstract
Multimodal survival methods combining gigapixel histology whole-slide images (WSIs) and transcriptomic profiles are particularly promising for patient prognostication and stratification. Current approaches involve tokenizing the WSIs into smaller patches (>10,000 patches) and transcriptomics into gene groups, which are then integrated using a Transformer for predicting outcomes. However, this process generates many tokens, which leads to high memory requirements for computing attention and complicates post-hoc interpretability analyses. Instead, we hypothesize that we can: (1) effectively summarize the morphological content of a WSI by condensing its constituting tokens using morphological prototypes, achieving more than 300x compression; and (2) accurately characterize cellular functions by encoding the transcriptomic profile with biological pathway prototypes, all in an unsupervised fashion. The resulting multimodal tokens are then processed by a fusion network, either with a Transformer or an optimal transport cross-alignment, which now operates with a small and fixed number of tokens without approximations. Extensive evaluation on six cancer types shows that our framework outperforms state-of-the-art methods with much less computation while unlocking new interpretability analyses.
