Table of Contents
Fetching ...

CK4Gen: A Knowledge Distillation Framework for Generating High-Utility Synthetic Survival Datasets in Healthcare

Nicholas I-Hsien Kuo, Blanca Gallego, Louisa Jorm

TL;DR

CK4Gen (Cox Knowledge for Generation), a novel framework that leverages knowledge distillation from Cox Proportional Hazards models to create synthetic survival datasets that preserve key clinical characteristics, including hazard ratios and survival curves, is introduced.

Abstract

Access to real clinical data is heavily restricted by privacy regulations, hindering both healthcare research and education. These constraints slow progress in developing new treatments and data-driven healthcare solutions, while also limiting students' access to real-world datasets, leaving them without essential practical skills. High-utility synthetic datasets are therefore critical for advancing research and providing meaningful training material. However, current generative models -- such as Variational Autoencoders (VAEs) and Generative Adversarial Networks (GANs) -- produce surface-level realism at the expense of healthcare utility, blending distinct patient profiles and producing synthetic data of limited practical relevance. To overcome these limitations, we introduce CK4Gen (Cox Knowledge for Generation), a novel framework that leverages knowledge distillation from Cox Proportional Hazards (CoxPH) models to create synthetic survival datasets that preserve key clinical characteristics, including hazard ratios and survival curves. CK4Gen avoids the interpolation issues seen in VAEs and GANs by maintaining distinct patient risk profiles, ensuring realistic and reliable outputs for research and educational use. Validated across four benchmark datasets -- GBSG2, ACTG320, WHAS500, and FLChain -- CK4Gen outperforms competing techniques by better aligning real and synthetic data, enhancing survival model performance in both discrimination and calibration via data augmentation. As CK4Gen is scalable across clinical conditions, and with code to be made publicly available, future researchers can apply it to their own datasets to generate synthetic versions suitable for open sharing.

CK4Gen: A Knowledge Distillation Framework for Generating High-Utility Synthetic Survival Datasets in Healthcare

TL;DR

CK4Gen (Cox Knowledge for Generation), a novel framework that leverages knowledge distillation from Cox Proportional Hazards models to create synthetic survival datasets that preserve key clinical characteristics, including hazard ratios and survival curves, is introduced.

Abstract

Access to real clinical data is heavily restricted by privacy regulations, hindering both healthcare research and education. These constraints slow progress in developing new treatments and data-driven healthcare solutions, while also limiting students' access to real-world datasets, leaving them without essential practical skills. High-utility synthetic datasets are therefore critical for advancing research and providing meaningful training material. However, current generative models -- such as Variational Autoencoders (VAEs) and Generative Adversarial Networks (GANs) -- produce surface-level realism at the expense of healthcare utility, blending distinct patient profiles and producing synthetic data of limited practical relevance. To overcome these limitations, we introduce CK4Gen (Cox Knowledge for Generation), a novel framework that leverages knowledge distillation from Cox Proportional Hazards (CoxPH) models to create synthetic survival datasets that preserve key clinical characteristics, including hazard ratios and survival curves. CK4Gen avoids the interpolation issues seen in VAEs and GANs by maintaining distinct patient risk profiles, ensuring realistic and reliable outputs for research and educational use. Validated across four benchmark datasets -- GBSG2, ACTG320, WHAS500, and FLChain -- CK4Gen outperforms competing techniques by better aligning real and synthetic data, enhancing survival model performance in both discrimination and calibration via data augmentation. As CK4Gen is scalable across clinical conditions, and with code to be made publicly available, future researchers can apply it to their own datasets to generate synthetic versions suitable for open sharing.

Paper Structure

This paper contains 65 sections, 26 equations, 25 figures, 28 tables, 3 algorithms.

Figures (25)

  • Figure 1: Overview of the CK4Gen framework for generating synthetic datasets. Panels (a), (b), and (c) illustrate the main pipeline. Panel (a) shows the process starting with sampling from the real dataset. Panel (b) depicts the CK4Gen framework, where the DCM encoder and SynthNet decoder reconstruct the real data into synthetic data. Panel (c) highlights the postprocessing of reconstructed data to restore the original scale and format, with Event and Duration directly copied from the original dataset and combined with the synthetic data to form the final output. Panels (d), (e), and (f) provide supplementary details. Panel (d) illustrates the preprocessing of raw patient data. Panel (e) shows the training of the DCM encoder using knowledge distillation from a pre-trained CoxPH model, with raw data as input. Panel (f) depicts the SynthNet decoder, which receives latent representations from the DCM encoder and is trained to reconstruct the preprocessed synthetic data. . This figure is best viewed in colours. Red: Appears in panel (a). Represents the raw, original data from the real dataset. Grey: Appears in panels (a) and (c). Represents Event and Duration from the original data. Yellow: Appears in panels (b) and (e). Represents the DCM encoder and its representations. Purple: Appears in panels (b) and (f). Represents the SynthNet decoder. Cyan: Appears in panel (c). Represents the raw reconstructed data generated by SynthNet. Blue: Appears in panel (c). Represents the postprocessed reconstructed data. Orange: Appears in panel (d). Represents the preprocessed ground truth data.
  • Figure 2: A side-by-side histograms comparing the distributions of binary clinical variables between the real (gold) and synthetic (purple) datasets for the GBSG2 study. The binary variables are compared using histograms. The figure reveals that CK4Gen is capable of synthesising datasets with both balanced and heavily imbalanced variable distributions.
  • Figure 3: A side-by-side comparison of correlation matrices for the GBSG2 dataset, with real data on the top and synthetic data on the bottom. Blue indicates negative correlations, while red indicates positive correlations. Although some individual correlations show slight variations, CK4Gen generates data that closely mirror the inter-variable relationships of the real data.
  • Figure 4: A side-by-side histograms and KDEs comparing the distributions of clinical variables between the real (gold) and synthetic (green) datasets for the ACTG320 study. The binary variables are compared using histograms, while numeric variables are overlaid using KDEs to assess the similarity between distributions. The figure highlights that CK4Gen can generate both balanced and imbalanced binary variables, as well as numeric variables with long-tailed distributions.
  • Figure 5: A side-by-side comparison of correlation matrices for the ACTG320 dataset, with real data on the left and synthetic data on the right. CK4Gen demonstrates the ability to generate datasets with highly realistic correlations among numeric variables, binary variables, and between numeric and binary variables.
  • ...and 20 more figures