Table of Contents
Fetching ...

FORESEE: Multimodal and Multi-view Representation Learning for Robust Prediction of Cancer Survival

Liangrui Pan, Yijun Peng, Yan Li, Yiyi Liang, Liwen Xu, Qingchun Liang, Shaoliang Peng

TL;DR

FORESEE addresses robustness in multimodal cancer survival prediction under incomplete data. It introduces a Cross Fusion Transformer to fuse cross-scale WSI features, a Hybrid Attention Encoder to denoise and summarize molecular data, and an asymmetric Triplet Masked Autoencoder to reconstruct missing intra-modal information. Trained end-to-end with modality-specific Cox losses and a TriMAE reconstruction objective, FORESEE demonstrates state-of-the-art performance on four TCGA cancers, outperforming several baselines and showing strong KM separation with significant p-values. Ablation studies confirm the necessity of multi-view pathology fusion, TriMAE for missing data, and HAE for the observed gains.

Abstract

Integrating the different data modalities of cancer patients can significantly improve the predictive performance of patient survival. However, most existing methods ignore the simultaneous utilization of rich semantic features at different scales in pathology images. When collecting multimodal data and extracting features, there is a likelihood of encountering intra-modality missing data, introducing noise into the multimodal data. To address these challenges, this paper proposes a new end-to-end framework, FORESEE, for robustly predicting patient survival by mining multimodal information. Specifically, the cross-fusion transformer effectively utilizes features at the cellular level, tissue level, and tumor heterogeneity level to correlate prognosis through a cross-scale feature cross-fusion method. This enhances the ability of pathological image feature representation. Secondly, the hybrid attention encoder (HAE) uses the denoising contextual attention module to obtain the contextual relationship features and local detail features of the molecular data. HAE's channel attention module obtains global features of molecular data. Furthermore, to address the issue of missing information within modalities, we propose an asymmetrically masked triplet masked autoencoder to reconstruct lost information within modalities. Extensive experiments demonstrate the superiority of our method over state-of-the-art methods on four benchmark datasets in both complete and missing settings.

FORESEE: Multimodal and Multi-view Representation Learning for Robust Prediction of Cancer Survival

TL;DR

FORESEE addresses robustness in multimodal cancer survival prediction under incomplete data. It introduces a Cross Fusion Transformer to fuse cross-scale WSI features, a Hybrid Attention Encoder to denoise and summarize molecular data, and an asymmetric Triplet Masked Autoencoder to reconstruct missing intra-modal information. Trained end-to-end with modality-specific Cox losses and a TriMAE reconstruction objective, FORESEE demonstrates state-of-the-art performance on four TCGA cancers, outperforming several baselines and showing strong KM separation with significant p-values. Ablation studies confirm the necessity of multi-view pathology fusion, TriMAE for missing data, and HAE for the observed gains.

Abstract

Integrating the different data modalities of cancer patients can significantly improve the predictive performance of patient survival. However, most existing methods ignore the simultaneous utilization of rich semantic features at different scales in pathology images. When collecting multimodal data and extracting features, there is a likelihood of encountering intra-modality missing data, introducing noise into the multimodal data. To address these challenges, this paper proposes a new end-to-end framework, FORESEE, for robustly predicting patient survival by mining multimodal information. Specifically, the cross-fusion transformer effectively utilizes features at the cellular level, tissue level, and tumor heterogeneity level to correlate prognosis through a cross-scale feature cross-fusion method. This enhances the ability of pathological image feature representation. Secondly, the hybrid attention encoder (HAE) uses the denoising contextual attention module to obtain the contextual relationship features and local detail features of the molecular data. HAE's channel attention module obtains global features of molecular data. Furthermore, to address the issue of missing information within modalities, we propose an asymmetrically masked triplet masked autoencoder to reconstruct lost information within modalities. Extensive experiments demonstrate the superiority of our method over state-of-the-art methods on four benchmark datasets in both complete and missing settings.
Paper Structure (14 sections, 8 equations, 6 figures, 2 tables)

This paper contains 14 sections, 8 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Obtaining graphical representations of multimodal data (Pathology, Molecular data, and EHR) from lung cancer patients.
  • Figure 2: Flowchart depicting multimodal patient survival prediction. The flow chart comprises the multi-view feature extraction , CFT, HAE, TriMAE, and multimodal prediction survival module. Notably, GlobalP denotes the Global Pooling layer, and GELU serves as the activation function.
  • Figure 3: Flowchart for TriMAE to reconstruct missing data in multimodal data.
  • Figure 4: KM analysis compares the performance of state-of-the-art methods with FORESEE tested on four cancer datasets.
  • Figure 5: Ablation experiments to evaluate the C-index performance of TriMAE on four cancer datasets in the absence of information within the modality.
  • ...and 1 more figures