Table of Contents
Fetching ...

SurvDiff: A Diffusion Model for Generating Synthetic Data in Survival Analysis

Marie Brockschmidt, Maresa Schröder, Stefan Feuerriegel

TL;DR

SurvDiff addresses the challenge of generating faithful synthetic survival data under right-censoring by learning an end-to-end diffusion model that jointly produces mixed-type covariates, event times, and censoring indicators. It introduces a survival-tailored diffusion loss that combines reconstruction with a Cox-like partial-likelihood objective and uses adaptive weighting to handle sparse, long-tail events. Across AIDS, GBSG2, and METABRIC, SurvDiff delivers superior covariate fidelity, event-time dynamics, and downstream survival performance compared with state-of-the-art baselines, including under small-sample and privacy-preserving settings. This work enables reliable synthetic survival datasets for training and evaluating downstream models while preserving clinically meaningful covariate and time-to-event structure.

Abstract

Survival analysis is a cornerstone of clinical research by modeling time-to-event outcomes such as metastasis, disease relapse, or patient death. Unlike standard tabular data, survival data often come with incomplete event information due to dropout, or loss to follow-up. This poses unique challenges for synthetic data generation, where it is crucial for clinical research to faithfully reproduce both the event-time distribution and the censoring mechanism. In this paper, we propose SurvDiff an end-to-end diffusion model specifically designed for generating synthetic data in survival analysis. SurvDiff is tailored to capture the data-generating mechanism by jointly generating mixed-type covariates, event times, and right-censoring, guided by a survival-tailored loss function. The loss encodes the time-to-event structure and directly optimizes for downstream survival tasks, which ensures that SurvDiff (i) reproduces realistic event-time distributions and (ii preserves the censoring mechanism. Across multiple datasets, we show that SurvDiff consistently outperforms state-of-the-art generative baselines in both distributional fidelity and survival model evaluation metrics across multiple medical datasets. To the best of our knowledge, SurvDiff is the first end-to-end diffusion model explicitly designed for generating synthetic survival data.

SurvDiff: A Diffusion Model for Generating Synthetic Data in Survival Analysis

TL;DR

SurvDiff addresses the challenge of generating faithful synthetic survival data under right-censoring by learning an end-to-end diffusion model that jointly produces mixed-type covariates, event times, and censoring indicators. It introduces a survival-tailored diffusion loss that combines reconstruction with a Cox-like partial-likelihood objective and uses adaptive weighting to handle sparse, long-tail events. Across AIDS, GBSG2, and METABRIC, SurvDiff delivers superior covariate fidelity, event-time dynamics, and downstream survival performance compared with state-of-the-art baselines, including under small-sample and privacy-preserving settings. This work enables reliable synthetic survival datasets for training and evaluating downstream models while preserving clinically meaningful covariate and time-to-event structure.

Abstract

Survival analysis is a cornerstone of clinical research by modeling time-to-event outcomes such as metastasis, disease relapse, or patient death. Unlike standard tabular data, survival data often come with incomplete event information due to dropout, or loss to follow-up. This poses unique challenges for synthetic data generation, where it is crucial for clinical research to faithfully reproduce both the event-time distribution and the censoring mechanism. In this paper, we propose SurvDiff an end-to-end diffusion model specifically designed for generating synthetic data in survival analysis. SurvDiff is tailored to capture the data-generating mechanism by jointly generating mixed-type covariates, event times, and right-censoring, guided by a survival-tailored loss function. The loss encodes the time-to-event structure and directly optimizes for downstream survival tasks, which ensures that SurvDiff (i) reproduces realistic event-time distributions and (ii preserves the censoring mechanism. Across multiple datasets, we show that SurvDiff consistently outperforms state-of-the-art generative baselines in both distributional fidelity and survival model evaluation metrics across multiple medical datasets. To the best of our knowledge, SurvDiff is the first end-to-end diffusion model explicitly designed for generating synthetic survival data.

Paper Structure

This paper contains 26 sections, 17 equations, 15 figures, 14 tables.

Figures (15)

  • Figure 1: SurvDiff for generating synthetic survival data. Our SurvDiff generates synthetic samples that retain the structure of the original data, including high-fidelity covariate distributions and faithful event-time distributions while preserving the censoring mechanism. The synthetic dataset can then be used to train downstream survival models without direct access to the original patient-level data.
  • Figure 2: Overview of our SurvDiff. SurvDiff consisting of forward diffusion, the backward diffusion and the novel survival-focused loss. Importantly, we distinguish the role of $e$ (event indicator; binary) and $t$ (time-to-event; continuous), which progress along different noising schemes due to the different variable types.
  • Figure 3: t-SNE visualization of covariate fidelity of real and synthetic data on GBSG2. $\Rightarrow$Takeaway: Synthetic samples from SurvDiff are well aligned with the original data. SurvDiff achieves high covariate fidelity.
  • Figure 4: Temporal distributions of real and synthetic survival data on METABRIC, shown separately for censored and uncensored patients. $\Rightarrow$Takeaway: Synthetic patients from SurvDiff exhibit similar event-time patterns as the real cohort, indicating strong temporal fidelity.
  • Figure 5: t-SNE visualization of covariate fidelity on the AIDS dataset.
  • ...and 10 more figures