Table of Contents
Fetching ...

Attention-Based Synthetic Data Generation for Calibration-Enhanced Survival Analysis: A Case Study for Chronic Kidney Disease Using Electronic Health Records

Nicholas I-Hsien Kuo, Blanca Gallego, Louisa Jorm

TL;DR

This work tackles the data-access and calibration challenges in survival analysis for healthcare by introducing Masked Clinical Modelling (MCM), an attention-based framework that can generate high-fidelity synthetic CKD EHR data and support conditional augmentation for stratified calibration. MCM reconstructs masked clinical features using contextual dependencies, enabling standalone data synthesis as well as targeted augmentation without retraining. In a CKD dataset, MCM improves general calibration and stabilizes calibration across 10 clinically defined subgroups, outperforming many baselines and approaching the performance of specialized survival-synthesis methods. The approach enhances model reliability and equitable representation in settings with restricted data sharing, offering practical value for precision medicine and efficient resource planning. Future work includes integrating privacy-preserving mechanisms and validating MCM on larger, more diverse cohorts.

Abstract

Access to real-world healthcare data is limited by stringent privacy regulations and data imbalances, hindering advancements in research and clinical applications. Synthetic data presents a promising solution, yet existing methods often fail to ensure the realism, utility, and calibration essential for robust survival analysis. Here, we introduce Masked Clinical Modelling (MCM), an attention-based framework capable of generating high-fidelity synthetic datasets that preserve critical clinical insights, such as hazard ratios, while enhancing survival model calibration. Unlike traditional statistical methods like SMOTE and machine learning models such as VAEs, MCM supports both standalone dataset synthesis for reproducibility and conditional simulation for targeted augmentation, addressing diverse research needs. Validated on a chronic kidney disease electronic health records dataset, MCM reduced the general calibration loss over the entire dataset by 15%; and MCM reduced a mean calibration loss by 9% across 10 clinically stratified subgroups, outperforming 15 alternative methods. By bridging data accessibility with translational utility, MCM advances the precision of healthcare models, promoting more efficient use of scarce healthcare resources.

Attention-Based Synthetic Data Generation for Calibration-Enhanced Survival Analysis: A Case Study for Chronic Kidney Disease Using Electronic Health Records

TL;DR

This work tackles the data-access and calibration challenges in survival analysis for healthcare by introducing Masked Clinical Modelling (MCM), an attention-based framework that can generate high-fidelity synthetic CKD EHR data and support conditional augmentation for stratified calibration. MCM reconstructs masked clinical features using contextual dependencies, enabling standalone data synthesis as well as targeted augmentation without retraining. In a CKD dataset, MCM improves general calibration and stabilizes calibration across 10 clinically defined subgroups, outperforming many baselines and approaching the performance of specialized survival-synthesis methods. The approach enhances model reliability and equitable representation in settings with restricted data sharing, offering practical value for precision medicine and efficient resource planning. Future work includes integrating privacy-preserving mechanisms and validating MCM on larger, more diverse cohorts.

Abstract

Access to real-world healthcare data is limited by stringent privacy regulations and data imbalances, hindering advancements in research and clinical applications. Synthetic data presents a promising solution, yet existing methods often fail to ensure the realism, utility, and calibration essential for robust survival analysis. Here, we introduce Masked Clinical Modelling (MCM), an attention-based framework capable of generating high-fidelity synthetic datasets that preserve critical clinical insights, such as hazard ratios, while enhancing survival model calibration. Unlike traditional statistical methods like SMOTE and machine learning models such as VAEs, MCM supports both standalone dataset synthesis for reproducibility and conditional simulation for targeted augmentation, addressing diverse research needs. Validated on a chronic kidney disease electronic health records dataset, MCM reduced the general calibration loss over the entire dataset by 15%; and MCM reduced a mean calibration loss by 9% across 10 clinically stratified subgroups, outperforming 15 alternative methods. By bridging data accessibility with translational utility, MCM advances the precision of healthcare models, promoting more efficient use of scarce healthcare resources.

Paper Structure

This paper contains 45 sections, 19 equations, 19 figures, 18 tables, 7 algorithms.

Figures (19)

  • Figure 1: An overview of our masked clinical modelling framework. Subfigure (a) illustrates the principles of masked language modelling, where random words in a sentence are masked and predicted. Subfigure (b) adapts this to MCM, masking random features in clinical data and reconstructing them. Subfigure (c) outlines the engineering pipeline: (1) scaling the raw clinical data to a standardised range; (2) applying random masking; (3) predicting missing values using the remaining features as context; and (4) rescaling features to their original ranges. Subfigure (d) presents the internal model architecture, comprising attention layers, residual connections, and linear layers, which enable the framework to capture dependencies between clinical covariates.
  • Figure 2: Comparison of real variables (gold) and synthetic counterparts (grey) from the CKD EHR dataset. Binary variables are visualised with histograms, while numeric variables are compared using kernel density estimations (KDEs).
  • Figure 3: Comparison of correlations among selected variables from the CKD EHR dataset. The top panel shows real data, and the bottom panel displays synthetic data. Blue represents negative correlations, and red represents positive correlations.
  • Figure 4: Complete Correlation.
  • Figure 5: Comparison of Kaplan-Meier (KM) curves between real and synthetic data.
  • ...and 14 more figures