FORESEE: Multimodal and Multi-view Representation Learning for Robust Prediction of Cancer Survival
Liangrui Pan, Yijun Peng, Yan Li, Yiyi Liang, Liwen Xu, Qingchun Liang, Shaoliang Peng
TL;DR
FORESEE addresses robustness in multimodal cancer survival prediction under incomplete data. It introduces a Cross Fusion Transformer to fuse cross-scale WSI features, a Hybrid Attention Encoder to denoise and summarize molecular data, and an asymmetric Triplet Masked Autoencoder to reconstruct missing intra-modal information. Trained end-to-end with modality-specific Cox losses and a TriMAE reconstruction objective, FORESEE demonstrates state-of-the-art performance on four TCGA cancers, outperforming several baselines and showing strong KM separation with significant p-values. Ablation studies confirm the necessity of multi-view pathology fusion, TriMAE for missing data, and HAE for the observed gains.
Abstract
Integrating the different data modalities of cancer patients can significantly improve the predictive performance of patient survival. However, most existing methods ignore the simultaneous utilization of rich semantic features at different scales in pathology images. When collecting multimodal data and extracting features, there is a likelihood of encountering intra-modality missing data, introducing noise into the multimodal data. To address these challenges, this paper proposes a new end-to-end framework, FORESEE, for robustly predicting patient survival by mining multimodal information. Specifically, the cross-fusion transformer effectively utilizes features at the cellular level, tissue level, and tumor heterogeneity level to correlate prognosis through a cross-scale feature cross-fusion method. This enhances the ability of pathological image feature representation. Secondly, the hybrid attention encoder (HAE) uses the denoising contextual attention module to obtain the contextual relationship features and local detail features of the molecular data. HAE's channel attention module obtains global features of molecular data. Furthermore, to address the issue of missing information within modalities, we propose an asymmetrically masked triplet masked autoencoder to reconstruct lost information within modalities. Extensive experiments demonstrate the superiority of our method over state-of-the-art methods on four benchmark datasets in both complete and missing settings.
