TrialSynth: Generation of Synthetic Sequential Clinical Trial Data

Chufan Gao; Mandis Beigi; Afrah Shafquat; Jacob Aptekar; Jimeng Sun

TrialSynth: Generation of Synthetic Sequential Clinical Trial Data

Chufan Gao, Mandis Beigi, Afrah Shafquat, Jacob Aptekar, Jimeng Sun

TL;DR

TrialSynth integrates a Variational Autoencoder with Hawkes Process-based sequential modeling to generate high-fidelity, time-stamped synthetic clinical trial data from small datasets. By employing a Transformer encoder to produce latent representations and a Hawkes-based decoder, it captures both the timing and order of clinical events, with optional knowledge of event types to improve fidelity. Across seven real-world datasets, TrialSynth variants outperform baselines in downstream predictive utility while offering tunable privacy-utility trade-offs via VAE sampling variance and event-type control. The approach supports targeted trial design applications, enabling realistic synthetic trajectories that protect patient privacy, though generalizability to larger, more diverse populations and deeper privacy guarantees remain areas for future work.

Abstract

Analyzing data from past clinical trials is part of the ongoing effort to optimize the design, implementation, and execution of new clinical trials and more efficiently bring life-saving interventions to market. While there have been recent advances in the generation of static context synthetic clinical trial data, due to both limited patient availability and constraints imposed by patient privacy needs, the generation of fine-grained synthetic time-sequential clinical trial data has been challenging. Given that patient trajectories over an entire clinical trial are of high importance for optimizing trial design and efforts to prevent harmful adverse events, there is a significant need for the generation of high-fidelity time-sequence clinical trial data. Here we introduce TrialSynth, a Variational Autoencoder (VAE) designed to address the specific challenges of generating synthetic time-sequence clinical trial data. Distinct from related clinical data VAE methods, the core of our method leverages Hawkes Processes (HP), which are particularly well-suited for modeling event-type and time gap prediction needed to capture the structure of sequential clinical trial data. Our experiments demonstrate that TrialSynth surpasses the performance of other comparable methods that can generate sequential clinical trial data at varying levels of fidelity / privacy tradeoff, enabling the generation of highly accurate event sequences across multiple real-world sequential event datasets with small patient source populations. Notably, our empirical findings highlight that TrialSynth not only outperforms existing clinical sequence-generating methods but also produces data with superior utility while empirically preserving patient privacy.

TrialSynth: Generation of Synthetic Sequential Clinical Trial Data

TL;DR

Abstract

Paper Structure (24 sections, 14 equations, 9 figures, 5 tables)

This paper contains 24 sections, 14 equations, 9 figures, 5 tables.

Introduction
Related Work
TrialSynth
Encoding and Decoding Hawkes Processes
Final Loss Terms
Event Type Information
Sequence Length Prediction
Experiments
Datasets
Baseline Methods
Utility Evaluation
Privacy evaluations
Utility / Privacy Trade-off
Discussion
Appendix
...and 9 more sections

Figures (9)

Figure 1: Visualization of data input and synthetic data generation of TrialSynth, the model input is the real patient events and their timestamps, and we wish to generate synthetic patient events and their timestamps. This is a particularly challenging task due to the small amount of patient data. TrialSynth also explicitly supports adding the event type information in the form of specifying the specific event types to generate.
Figure 2: Diagram of the TrialSynth Encoder-Decoder structure. Here, the model input is the real patient event sequence + time, which trains a VAE model to the same output time + event sequence. The event sequence length for each event is also predicted. The transformer encoder processes each input timestep, then output embeddings are individually transformed to the z-latent space via a neural network. Sampling and decoding occur from each timestep-specific z-latent representation.
Figure 3: Example of a generated sequence from TrialSynth from NCT00003299 plotted by the individual events. Red dots and lines denote ground truth event occurrence and time between events respectively. In this case, the time is in Days. The blue dots and lines are the predicted events. Numerical events such as wbc (white blood cell count) are discretized based on their unique values in the real data. This will be corrected in the new version. Each prediction is linked with dashed lines for clarity.
Figure 4: 2 Privacy-Utility Tradeoff examples in TrialSynth: Performance of distance to closest record (DCR) (red) and downstream ROC (blue) metrics at varying levels of VAE sampling variance (from 0.1 to 4), represented as the "Var Multiplier."
Figure 5: Example of another generated sequence from TrialSynth (Events Known) from NCT00003299. The blue dots denoting the specific event timestamp prediction. The red dots are the ground truth timestamps and the ground truth predictions. Each prediction is also linked with dashed lines for clarity
...and 4 more figures

TrialSynth: Generation of Synthetic Sequential Clinical Trial Data

TL;DR

Abstract

TrialSynth: Generation of Synthetic Sequential Clinical Trial Data

Authors

TL;DR

Abstract

Table of Contents

Figures (9)