Forecasting-based Biomedical Time-series Data Synthesis for Open Data and Robust AI
Youngjoon Lee, Seongmin Cho, Yehhyun Jo, Jinu Gong, Hyunjoo Jenny Lee, Joonhyuk Kang
TL;DR
The paper tackles data scarcity and privacy barriers in biomedical time-series AI by proposing forecasting-based, class-conditional data synthesis using time-series forecasters trained on sliding windows to generate high-fidelity EEG/EMG sequences. It evaluates the approach through a framework that transforms signals into STFT spectrograms and trains a ResNet-18 under original-only, synthetic-only, and combined data regimes, reporting accuracy and per-class F1 metrics. Empirical results across four subjects and sixteen forecasters show Transformer- and MLP-based forecasters achieve top performance, with synthetic data often approaching or surpassing real-data baselines and, when combined with real data, yielding robust improvements and reduced variance. Quality and privacy assessments indicate high statistical similarity with nonzero distance to real records, suggesting privacy-preserving synthetic data that can meaningfully augment open datasets. Overall, forecasting-based synthetic data demonstrates a practical path to open, privacy-conscious, robust biomedical AI, with strong potential for extension to broader modalities and clinical applications.
Abstract
The limited data availability due to strict privacy regulations and significant resource demands severely constrains biomedical time-series AI development, which creates a critical gap between data requirements and accessibility. Synthetic data generation presents a promising solution by producing artificial datasets that maintain the statistical properties of real biomedical time-series data without compromising patient confidentiality. While GANs, VAEs, and diffusion models capture global data distributions, forecasting models offer inductive biases tailored for sequential dynamics. We propose a framework for synthetic biomedical time-series data generation based on recent forecasting models that accurately replicates complex electrophysiological signals such as EEG and EMG with high fidelity. These synthetic datasets can be freely shared for open AI development and consistently improve downstream model performance. Numerical results on sleep-stage classification show up to a 3.71\% performance gain with augmentation and a 91.00\% synthetic-only accuracy that surpasses the real-data-only baseline.
