Table of Contents
Fetching ...

A Deep Learning Approach for Overall Survival Prediction in Lung Cancer with Missing Values

Camillo Maria Caruso, Valerio Guarrasi, Sara Ramella, Paolo Soda

TL;DR

This work tackles OS prediction in NSCLC under pervasive missing data by introducing a transformer-based model that masks missing features during both training and inference, thus removing the need for imputation. It employs a first-hitting-time–based loss together with a concordance-based ranking term to leverage both uncensored and censored patients and to capture time-varying risks. On the CLARO dataset, the approach outperforms state-of-the-art imputation-based methods and classical survival models across multiple time granularities, achieving higher time-dependent Ct-index scores. The method offers practical clinical utility by enabling accurate prognosis without imputing missing data, with efficient inference suitable for real-time decision support and potential extensions to multi-center datasets and imaging data.

Abstract

In the field of lung cancer research, particularly in the analysis of overall survival (OS), artificial intelligence (AI) serves crucial roles with specific aims. Given the prevalent issue of missing data in the medical domain, our primary objective is to develop an AI model capable of dynamically handling this missing data. Additionally, we aim to leverage all accessible data, effectively analyzing both uncensored patients who have experienced the event of interest and censored patients who have not, by embedding a specialized technique within our AI model, not commonly utilized in other AI tasks. Through the realization of these objectives, our model aims to provide precise OS predictions for non-small cell lung cancer (NSCLC) patients, thus overcoming these significant challenges. We present a novel approach to survival analysis with missing values in the context of NSCLC, which exploits the strengths of the transformer architecture to account only for available features without requiring any imputation strategy. More specifically, this model tailors the transformer architecture to tabular data by adapting its feature embedding and masked self-attention to mask missing data and fully exploit the available ones. By making use of ad-hoc designed losses for OS, it is able to account for both censored and uncensored patients, as well as changes in risks over time. We compared our method with state-of-the-art models for survival analysis coupled with different imputation strategies. We evaluated the results obtained over a period of 6 years using different time granularities obtaining a Ct-index, a time-dependent variant of the C-index, of 71.97, 77.58 and 80.72 for time units of 1 month, 1 year and 2 years, respectively, outperforming all state-of-the-art methods regardless of the imputation method used.

A Deep Learning Approach for Overall Survival Prediction in Lung Cancer with Missing Values

TL;DR

This work tackles OS prediction in NSCLC under pervasive missing data by introducing a transformer-based model that masks missing features during both training and inference, thus removing the need for imputation. It employs a first-hitting-time–based loss together with a concordance-based ranking term to leverage both uncensored and censored patients and to capture time-varying risks. On the CLARO dataset, the approach outperforms state-of-the-art imputation-based methods and classical survival models across multiple time granularities, achieving higher time-dependent Ct-index scores. The method offers practical clinical utility by enabling accurate prognosis without imputing missing data, with efficient inference suitable for real-time decision support and potential extensions to multi-center datasets and imaging data.

Abstract

In the field of lung cancer research, particularly in the analysis of overall survival (OS), artificial intelligence (AI) serves crucial roles with specific aims. Given the prevalent issue of missing data in the medical domain, our primary objective is to develop an AI model capable of dynamically handling this missing data. Additionally, we aim to leverage all accessible data, effectively analyzing both uncensored patients who have experienced the event of interest and censored patients who have not, by embedding a specialized technique within our AI model, not commonly utilized in other AI tasks. Through the realization of these objectives, our model aims to provide precise OS predictions for non-small cell lung cancer (NSCLC) patients, thus overcoming these significant challenges. We present a novel approach to survival analysis with missing values in the context of NSCLC, which exploits the strengths of the transformer architecture to account only for available features without requiring any imputation strategy. More specifically, this model tailors the transformer architecture to tabular data by adapting its feature embedding and masked self-attention to mask missing data and fully exploit the available ones. By making use of ad-hoc designed losses for OS, it is able to account for both censored and uncensored patients, as well as changes in risks over time. We compared our method with state-of-the-art models for survival analysis coupled with different imputation strategies. We evaluated the results obtained over a period of 6 years using different time granularities obtaining a Ct-index, a time-dependent variant of the C-index, of 71.97, 77.58 and 80.72 for time units of 1 month, 1 year and 2 years, respectively, outperforming all state-of-the-art methods regardless of the imputation method used.
Paper Structure (10 sections, 6 equations, 4 figures, 2 tables)

This paper contains 10 sections, 6 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Schematic representation of the proposed model: (A) Architecture of the proposed approach and (B) example of positional encoding, mask and output vector, where the $-$ symbol represents a missing feature. Note that for simplicity of representation, just a few features are reported and none of the preprocessing procedures are applied.
  • Figure 2: Prediction errors made for different patient groups by the proposed method, taking into account the actual times of occurrence of the event. The graph reports the mean and standard error of the prediction of uncensored patients, as for censored patients no valid prediction can be made since the event did not occur.
  • Figure 3: SHAP summary plots of features' contributions in the 3 models implemented with the 3 time units: Panel A) 1 month; Panel B) 1 year; Panel C) 2 years. The plots show the global feature importance by averaging the absolute SHAP values obtained for each patient and time represented in the output vector.
  • Figure 4: Ablation study of the two terms of the loss function proposed in bib:deephit: (A) Average performance (Ct-index) and (B) mean number of epochs to achieve convergence.