Table of Contents
Fetching ...

Forecasting-based Biomedical Time-series Data Synthesis for Open Data and Robust AI

Youngjoon Lee, Seongmin Cho, Yehhyun Jo, Jinu Gong, Hyunjoo Jenny Lee, Joonhyuk Kang

TL;DR

The paper tackles data scarcity and privacy barriers in biomedical time-series AI by proposing forecasting-based, class-conditional data synthesis using time-series forecasters trained on sliding windows to generate high-fidelity EEG/EMG sequences. It evaluates the approach through a framework that transforms signals into STFT spectrograms and trains a ResNet-18 under original-only, synthetic-only, and combined data regimes, reporting accuracy and per-class F1 metrics. Empirical results across four subjects and sixteen forecasters show Transformer- and MLP-based forecasters achieve top performance, with synthetic data often approaching or surpassing real-data baselines and, when combined with real data, yielding robust improvements and reduced variance. Quality and privacy assessments indicate high statistical similarity with nonzero distance to real records, suggesting privacy-preserving synthetic data that can meaningfully augment open datasets. Overall, forecasting-based synthetic data demonstrates a practical path to open, privacy-conscious, robust biomedical AI, with strong potential for extension to broader modalities and clinical applications.

Abstract

The limited data availability due to strict privacy regulations and significant resource demands severely constrains biomedical time-series AI development, which creates a critical gap between data requirements and accessibility. Synthetic data generation presents a promising solution by producing artificial datasets that maintain the statistical properties of real biomedical time-series data without compromising patient confidentiality. While GANs, VAEs, and diffusion models capture global data distributions, forecasting models offer inductive biases tailored for sequential dynamics. We propose a framework for synthetic biomedical time-series data generation based on recent forecasting models that accurately replicates complex electrophysiological signals such as EEG and EMG with high fidelity. These synthetic datasets can be freely shared for open AI development and consistently improve downstream model performance. Numerical results on sleep-stage classification show up to a 3.71\% performance gain with augmentation and a 91.00\% synthetic-only accuracy that surpasses the real-data-only baseline.

Forecasting-based Biomedical Time-series Data Synthesis for Open Data and Robust AI

TL;DR

The paper tackles data scarcity and privacy barriers in biomedical time-series AI by proposing forecasting-based, class-conditional data synthesis using time-series forecasters trained on sliding windows to generate high-fidelity EEG/EMG sequences. It evaluates the approach through a framework that transforms signals into STFT spectrograms and trains a ResNet-18 under original-only, synthetic-only, and combined data regimes, reporting accuracy and per-class F1 metrics. Empirical results across four subjects and sixteen forecasters show Transformer- and MLP-based forecasters achieve top performance, with synthetic data often approaching or surpassing real-data baselines and, when combined with real data, yielding robust improvements and reduced variance. Quality and privacy assessments indicate high statistical similarity with nonzero distance to real records, suggesting privacy-preserving synthetic data that can meaningfully augment open datasets. Overall, forecasting-based synthetic data demonstrates a practical path to open, privacy-conscious, robust biomedical AI, with strong potential for extension to broader modalities and clinical applications.

Abstract

The limited data availability due to strict privacy regulations and significant resource demands severely constrains biomedical time-series AI development, which creates a critical gap between data requirements and accessibility. Synthetic data generation presents a promising solution by producing artificial datasets that maintain the statistical properties of real biomedical time-series data without compromising patient confidentiality. While GANs, VAEs, and diffusion models capture global data distributions, forecasting models offer inductive biases tailored for sequential dynamics. We propose a framework for synthetic biomedical time-series data generation based on recent forecasting models that accurately replicates complex electrophysiological signals such as EEG and EMG with high fidelity. These synthetic datasets can be freely shared for open AI development and consistently improve downstream model performance. Numerical results on sleep-stage classification show up to a 3.71\% performance gain with augmentation and a 91.00\% synthetic-only accuracy that surpasses the real-data-only baseline.

Paper Structure

This paper contains 15 sections, 10 equations, 7 figures, 1 table, 1 algorithm.

Figures (7)

  • Figure 1: Open-source contribution and data gap fulfillment. (a) Open-source data contribution: Synthetic datasets are uploaded to public repositories (Hugging Face, Kaggle, AWS), enabling broader access for open AI development while maintaining privacy compliance. (b) Data gap fulfillment: Synthetic samples populate underrepresented regions in the feature space of real datasets, enhancing classifier training by providing a more comprehensive representation of the ideal data distribution.
  • Figure 2: Framework for synthetic biomedical time-series data generation. Synthetic data generation using time-series forecasters: The forecasting model trained on real biomedical signals generates similar patterns or enriches underrepresented segments.
  • Figure 3: The two-stage EEG data labeling procedure and corresponding power spectral density (PSD) analysis. (a) EEG raw data labeling process using EMG signals to differentiate between WAKE and SLEEP states, followed by EEG frequency domain filtering to distinguish NREM and REM sleep stages. (b) Normalized mean PSD analysis demonstrates distinct spectral signatures across four different subjects, validating the effectiveness and consistency of the labeling process.
  • Figure 4: Class-wise F1-scores across subjects for the synthetic-only (S) and combined original with synthetic data (O+S) conditions. For each model family, the best-performing forecaster was selected separately for S and O+S settings. All values represent mean ± std computed across 5 random seeds.
  • Figure 5: UMAP visualization of original and synthetic EEG data across subjects. The three panels display (a) Original data (O), (b) synthetic only (S), and (c) combined original with synthetic data (O+S) for each subject. Marker shapes denote sleep stages: WAKE (triangles), NREM (squares), and REM (circles).
  • ...and 2 more figures