A Deep Learning Approach for Overall Survival Prediction in Lung Cancer with Missing Values
Camillo Maria Caruso, Valerio Guarrasi, Sara Ramella, Paolo Soda
TL;DR
This work tackles OS prediction in NSCLC under pervasive missing data by introducing a transformer-based model that masks missing features during both training and inference, thus removing the need for imputation. It employs a first-hitting-time–based loss together with a concordance-based ranking term to leverage both uncensored and censored patients and to capture time-varying risks. On the CLARO dataset, the approach outperforms state-of-the-art imputation-based methods and classical survival models across multiple time granularities, achieving higher time-dependent Ct-index scores. The method offers practical clinical utility by enabling accurate prognosis without imputing missing data, with efficient inference suitable for real-time decision support and potential extensions to multi-center datasets and imaging data.
Abstract
In the field of lung cancer research, particularly in the analysis of overall survival (OS), artificial intelligence (AI) serves crucial roles with specific aims. Given the prevalent issue of missing data in the medical domain, our primary objective is to develop an AI model capable of dynamically handling this missing data. Additionally, we aim to leverage all accessible data, effectively analyzing both uncensored patients who have experienced the event of interest and censored patients who have not, by embedding a specialized technique within our AI model, not commonly utilized in other AI tasks. Through the realization of these objectives, our model aims to provide precise OS predictions for non-small cell lung cancer (NSCLC) patients, thus overcoming these significant challenges. We present a novel approach to survival analysis with missing values in the context of NSCLC, which exploits the strengths of the transformer architecture to account only for available features without requiring any imputation strategy. More specifically, this model tailors the transformer architecture to tabular data by adapting its feature embedding and masked self-attention to mask missing data and fully exploit the available ones. By making use of ad-hoc designed losses for OS, it is able to account for both censored and uncensored patients, as well as changes in risks over time. We compared our method with state-of-the-art models for survival analysis coupled with different imputation strategies. We evaluated the results obtained over a period of 6 years using different time granularities obtaining a Ct-index, a time-dependent variant of the C-index, of 71.97, 77.58 and 80.72 for time units of 1 month, 1 year and 2 years, respectively, outperforming all state-of-the-art methods regardless of the imputation method used.
