Synthetic Time Series Data Generation for Healthcare Applications: A PCG Case Study
Ainaz Jamshidi, Muhammad Arif, Sabir Ali Kalhoro, Alexander Gelbukh
TL;DR
This paper addresses the scarcity of high-quality synthetic phonocardiogram (PCG) data for healthcare. It benchmarks three generative approaches—WaveNet (autoregressive), DoppelGANger (GAN-based with dual generators), and DiffWave (diffusion-based)—on PCG data from the PhysioNet Challenge 2022. Using a comprehensive evaluation framework that includes AR metrics (MAE, MSE, SMAPE, ACD) and generative metrics (t-SNE, discriminative score, MMD, JSD), the study demonstrates that all three models can produce healthy PCG signals whose distributions closely resemble real data, with specific quantitative evidence (e.g., MMD ≈ 0.0001, JSD ≈ 0.014–0.015; discriminative accuracy near 50%). The results support the feasibility of privacy-preserving data augmentation and provide a baseline for future work to synthesize abnormal PCG signals for improved murmur detection in cardiology.
Abstract
The generation of high-quality medical time series data is essential for advancing healthcare diagnostics and safeguarding patient privacy. Specifically, synthesizing realistic phonocardiogram (PCG) signals offers significant potential as a cost-effective and efficient tool for cardiac disease pre-screening. Despite its potential, the synthesis of PCG signals for this specific application received limited attention in research. In this study, we employ and compare three state-of-the-art generative models from different categories - WaveNet, DoppelGANger, and DiffWave - to generate high-quality PCG data. We use data from the George B. Moody PhysioNet Challenge 2022. Our methods are evaluated using various metrics widely used in the previous literature in the domain of time series data generation, such as mean absolute error and maximum mean discrepancy. Our results demonstrate that the generated PCG data closely resembles the original datasets, indicating the effectiveness of our generative models in producing realistic synthetic PCG data. In our future work, we plan to incorporate this method into a data augmentation pipeline to synthesize abnormal PCG signals with heart murmurs, in order to address the current scarcity of abnormal data. We hope to improve the robustness and accuracy of diagnostic tools in cardiology, enhancing their effectiveness in detecting heart murmurs.
