Table of Contents
Fetching ...

Multi-modal Intermediate Feature Interaction AutoEncoder for Overall Survival Prediction of Esophageal Squamous Cell Cancer

Chengyu Wu, Yatao Zhang, Yaqi Wang, Qifeng Wang, Shuai Wang

TL;DR

The paper addresses ESCC survival prediction by leveraging multi-modal data (CT-derived features and clinical/tabular data) through a novel autoencoder framework, MIFI-AE. It introduces two modules, CMIFM for cross-modal feature interaction and MFFSM for multi-scale feature map fusion, plus MJ-Loss to align modalities and optimize survival prediction via a Cox partial likelihood objective. The approach achieves a C-index of $0.697 \pm 0.02$ on $1{,}354$ ESCC patients, outperforming several baselines, and demonstrates strong risk stratification with a highly significant log-rank result ($p = 5.8 \times 10^{-16}$). Ablation confirms the distinct contributions of the modules and the alignment loss, indicating improved handling of cross-modal semantic gaps and more robust prognosis guidance for clinical decision-making.

Abstract

Survival prediction for esophageal squamous cell cancer (ESCC) is crucial for doctors to assess a patient's condition and tailor treatment plans. The application and development of multi-modal deep learning in this field have attracted attention in recent years. However, the prognostically relevant features between cross-modalities have not been further explored in previous studies, which could hinder the performance of the model. Furthermore, the inherent semantic gap between different modal feature representations is also ignored. In this work, we propose a novel autoencoder-based deep learning model to predict the overall survival of the ESCC. Two novel modules were designed for multi-modal prognosis-related feature reinforcement and modeling ability enhancement. In addition, a novel joint loss was proposed to make the multi-modal feature representations more aligned. Comparison and ablation experiments demonstrated that our model can achieve satisfactory results in terms of discriminative ability, risk stratification, and the effectiveness of the proposed modules.

Multi-modal Intermediate Feature Interaction AutoEncoder for Overall Survival Prediction of Esophageal Squamous Cell Cancer

TL;DR

The paper addresses ESCC survival prediction by leveraging multi-modal data (CT-derived features and clinical/tabular data) through a novel autoencoder framework, MIFI-AE. It introduces two modules, CMIFM for cross-modal feature interaction and MFFSM for multi-scale feature map fusion, plus MJ-Loss to align modalities and optimize survival prediction via a Cox partial likelihood objective. The approach achieves a C-index of on ESCC patients, outperforming several baselines, and demonstrates strong risk stratification with a highly significant log-rank result (). Ablation confirms the distinct contributions of the modules and the alignment loss, indicating improved handling of cross-modal semantic gaps and more robust prognosis guidance for clinical decision-making.

Abstract

Survival prediction for esophageal squamous cell cancer (ESCC) is crucial for doctors to assess a patient's condition and tailor treatment plans. The application and development of multi-modal deep learning in this field have attracted attention in recent years. However, the prognostically relevant features between cross-modalities have not been further explored in previous studies, which could hinder the performance of the model. Furthermore, the inherent semantic gap between different modal feature representations is also ignored. In this work, we propose a novel autoencoder-based deep learning model to predict the overall survival of the ESCC. Two novel modules were designed for multi-modal prognosis-related feature reinforcement and modeling ability enhancement. In addition, a novel joint loss was proposed to make the multi-modal feature representations more aligned. Comparison and ablation experiments demonstrated that our model can achieve satisfactory results in terms of discriminative ability, risk stratification, and the effectiveness of the proposed modules.
Paper Structure (17 sections, 4 equations, 4 figures, 3 tables)

This paper contains 17 sections, 4 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: The pipeline of proposed Multi-modal Intermediate Feature Interaction AutoEncoder (MIFI-AE). (a) Encoding part of the MIFI-AE. (b) Decoding part of the MIFI-AE.
  • Figure 2: The illustration of proposed Multi-scale Feature map Fusion-Separation Module (MFFSM).
  • Figure 3: The illustration of proposed Cross-modal Multi-step Intermediate Fusion Module (CMIFM).
  • Figure 4: Kaplan-Meier (KM) curves of OS for compared and proposed models.