Attention-Based Synthetic Data Generation for Calibration-Enhanced Survival Analysis: A Case Study for Chronic Kidney Disease Using Electronic Health Records
Nicholas I-Hsien Kuo, Blanca Gallego, Louisa Jorm
TL;DR
This work tackles the data-access and calibration challenges in survival analysis for healthcare by introducing Masked Clinical Modelling (MCM), an attention-based framework that can generate high-fidelity synthetic CKD EHR data and support conditional augmentation for stratified calibration. MCM reconstructs masked clinical features using contextual dependencies, enabling standalone data synthesis as well as targeted augmentation without retraining. In a CKD dataset, MCM improves general calibration and stabilizes calibration across 10 clinically defined subgroups, outperforming many baselines and approaching the performance of specialized survival-synthesis methods. The approach enhances model reliability and equitable representation in settings with restricted data sharing, offering practical value for precision medicine and efficient resource planning. Future work includes integrating privacy-preserving mechanisms and validating MCM on larger, more diverse cohorts.
Abstract
Access to real-world healthcare data is limited by stringent privacy regulations and data imbalances, hindering advancements in research and clinical applications. Synthetic data presents a promising solution, yet existing methods often fail to ensure the realism, utility, and calibration essential for robust survival analysis. Here, we introduce Masked Clinical Modelling (MCM), an attention-based framework capable of generating high-fidelity synthetic datasets that preserve critical clinical insights, such as hazard ratios, while enhancing survival model calibration. Unlike traditional statistical methods like SMOTE and machine learning models such as VAEs, MCM supports both standalone dataset synthesis for reproducibility and conditional simulation for targeted augmentation, addressing diverse research needs. Validated on a chronic kidney disease electronic health records dataset, MCM reduced the general calibration loss over the entire dataset by 15%; and MCM reduced a mean calibration loss by 9% across 10 clinically stratified subgroups, outperforming 15 alternative methods. By bridging data accessibility with translational utility, MCM advances the precision of healthcare models, promoting more efficient use of scarce healthcare resources.
