SurvDiff: A Diffusion Model for Generating Synthetic Data in Survival Analysis
Marie Brockschmidt, Maresa Schröder, Stefan Feuerriegel
TL;DR
SurvDiff addresses the challenge of generating faithful synthetic survival data under right-censoring by learning an end-to-end diffusion model that jointly produces mixed-type covariates, event times, and censoring indicators. It introduces a survival-tailored diffusion loss that combines reconstruction with a Cox-like partial-likelihood objective and uses adaptive weighting to handle sparse, long-tail events. Across AIDS, GBSG2, and METABRIC, SurvDiff delivers superior covariate fidelity, event-time dynamics, and downstream survival performance compared with state-of-the-art baselines, including under small-sample and privacy-preserving settings. This work enables reliable synthetic survival datasets for training and evaluating downstream models while preserving clinically meaningful covariate and time-to-event structure.
Abstract
Survival analysis is a cornerstone of clinical research by modeling time-to-event outcomes such as metastasis, disease relapse, or patient death. Unlike standard tabular data, survival data often come with incomplete event information due to dropout, or loss to follow-up. This poses unique challenges for synthetic data generation, where it is crucial for clinical research to faithfully reproduce both the event-time distribution and the censoring mechanism. In this paper, we propose SurvDiff an end-to-end diffusion model specifically designed for generating synthetic data in survival analysis. SurvDiff is tailored to capture the data-generating mechanism by jointly generating mixed-type covariates, event times, and right-censoring, guided by a survival-tailored loss function. The loss encodes the time-to-event structure and directly optimizes for downstream survival tasks, which ensures that SurvDiff (i) reproduces realistic event-time distributions and (ii preserves the censoring mechanism. Across multiple datasets, we show that SurvDiff consistently outperforms state-of-the-art generative baselines in both distributional fidelity and survival model evaluation metrics across multiple medical datasets. To the best of our knowledge, SurvDiff is the first end-to-end diffusion model explicitly designed for generating synthetic survival data.
