Table of Contents
Fetching ...

Real-valued (Medical) Time Series Generation with Recurrent Conditional GANs

Cristóbal Esteban, Stephanie L. Hyland, Gunnar Rätsch

TL;DR

The paper addresses privacy-restricted medical data by introducing Recurrent GANs (RGAN) and Recurrent Conditional GANs (RCGAN) capable of generating realistic real-valued time-series. It introduces evaluation methods (MMD, TSTR, TRTS) to measure distributional fidelity and downstream usefulness of synthetic data. Through experiments on toy sequences, MNIST serialized as time series, and ICU data from the eICU database, the authors demonstrate that synthetic data can train models with performance close to real data, while also analyzing privacy aspects and the potential of differential privacy. The work has practical implications for privacy-preserving data sharing and the development of synthetic benchmarks in medicine, balancing realism with privacy safeguards.

Abstract

Generative Adversarial Networks (GANs) have shown remarkable success as a framework for training models to produce realistic-looking data. In this work, we propose a Recurrent GAN (RGAN) and Recurrent Conditional GAN (RCGAN) to produce realistic real-valued multi-dimensional time series, with an emphasis on their application to medical data. RGANs make use of recurrent neural networks in the generator and the discriminator. In the case of RCGANs, both of these RNNs are conditioned on auxiliary information. We demonstrate our models in a set of toy datasets, where we show visually and quantitatively (using sample likelihood and maximum mean discrepancy) that they can successfully generate realistic time-series. We also describe novel evaluation methods for GANs, where we generate a synthetic labelled training dataset, and evaluate on a real test set the performance of a model trained on the synthetic data, and vice-versa. We illustrate with these metrics that RCGANs can generate time-series data useful for supervised training, with only minor degradation in performance on real test data. This is demonstrated on digit classification from 'serialised' MNIST and by training an early warning system on a medical dataset of 17,000 patients from an intensive care unit. We further discuss and analyse the privacy concerns that may arise when using RCGANs to generate realistic synthetic medical time series data.

Real-valued (Medical) Time Series Generation with Recurrent Conditional GANs

TL;DR

The paper addresses privacy-restricted medical data by introducing Recurrent GANs (RGAN) and Recurrent Conditional GANs (RCGAN) capable of generating realistic real-valued time-series. It introduces evaluation methods (MMD, TSTR, TRTS) to measure distributional fidelity and downstream usefulness of synthetic data. Through experiments on toy sequences, MNIST serialized as time series, and ICU data from the eICU database, the authors demonstrate that synthetic data can train models with performance close to real data, while also analyzing privacy aspects and the potential of differential privacy. The work has practical implications for privacy-preserving data sharing and the development of synthetic benchmarks in medicine, balancing realism with privacy safeguards.

Abstract

Generative Adversarial Networks (GANs) have shown remarkable success as a framework for training models to produce realistic-looking data. In this work, we propose a Recurrent GAN (RGAN) and Recurrent Conditional GAN (RCGAN) to produce realistic real-valued multi-dimensional time series, with an emphasis on their application to medical data. RGANs make use of recurrent neural networks in the generator and the discriminator. In the case of RCGANs, both of these RNNs are conditioned on auxiliary information. We demonstrate our models in a set of toy datasets, where we show visually and quantitatively (using sample likelihood and maximum mean discrepancy) that they can successfully generate realistic time-series. We also describe novel evaluation methods for GANs, where we generate a synthetic labelled training dataset, and evaluate on a real test set the performance of a model trained on the synthetic data, and vice-versa. We illustrate with these metrics that RCGANs can generate time-series data useful for supervised training, with only minor degradation in performance on real test data. This is demonstrated on digit classification from 'serialised' MNIST and by training an early warning system on a medical dataset of 17,000 patients from an intensive care unit. We further discuss and analyse the privacy concerns that may arise when using RCGANs to generate realistic synthetic medical time series data.

Paper Structure

This paper contains 19 sections, 5 equations, 4 figures, 3 tables, 1 algorithm.

Figures (4)

  • Figure 1: Architecture of Recurrent GAN and Conditional Recurrent GAN models.
  • Figure 2: RGAN is capable of generating realistic-looking examples.
  • Figure 3: Trace of generator (dotted), discriminator (solid) loss, MMD$^2$ score and log likelihood of generated samples under the data distribution during training for RGAN generating smooth sequences (output in Figure \ref{['fig:toy_sinerbf']}.)
  • Figure 4: Back-projecting training examples into the latent space and linearly interpolating them produces smooth variation in the sample space. Top plot shows sample-space distance from top (green, dashed) sample to bottom (orange, dotted). Distance measure is RBF kernel with bandwidth chosen as median pairwise distance between training samples. The original training examples are shown in dotted lines in the bottom and second-from-top plots.