Table of Contents
Fetching ...

Embedding-Space Data Augmentation to Prevent Membership Inference Attacks in Clinical Time Series Forecasting

Marius Fracarolli, Michael Staniek, Stefan Riezler

TL;DR

This work tackles privacy concerns in clinical time-series forecasting by exploiting embedding-space data augmentation to mitigate Membership Inference Attacks (MIA) while preserving predictive accuracy. It contrasts zeroth-order optimization (ZOO) in embedding space, its PCA-restricted variant (ZOO-PCA), and MixUp, showing that ZOO-PCA yields the best privacy-utility tradeoff and that MixUp enhances generalization. The study demonstrates that augmenting training data with synthetic embeddings can significantly reduce the attacker’s advantage, as measured by the TPR/FPR ratio, without compromising test performance; DP-SGD can provide strong privacy but at a substantial utility cost. The findings suggest embedding-space augmentation as a practical defense for privacy-preserving TSF on public EHR datasets, with potential for hybrid approaches and broader applicability to deep architectures and privacy attacks.

Abstract

Balancing strong privacy guarantees with high predictive performance is critical for time series forecasting (TSF) tasks involving Electronic Health Records (EHR). In this study, we explore how data augmentation can mitigate Membership Inference Attacks (MIA) on TSF models. We show that retraining with synthetic data can substantially reduce the effectiveness of loss-based MIAs by reducing the attacker's true-positive to false-positive ratio. The key challenge is generating synthetic samples that closely resemble the original training data to confuse the attacker, while also introducing enough novelty to enhance the model's ability to generalize to unseen data. We examine multiple augmentation strategies - Zeroth-Order Optimization (ZOO), a variant of ZOO constrained by Principal Component Analysis (ZOO-PCA), and MixUp - to strengthen model resilience without sacrificing accuracy. Our experimental results show that ZOO-PCA yields the best reductions in TPR/FPR ratio for MIA attacks without sacrificing performance on test data.

Embedding-Space Data Augmentation to Prevent Membership Inference Attacks in Clinical Time Series Forecasting

TL;DR

This work tackles privacy concerns in clinical time-series forecasting by exploiting embedding-space data augmentation to mitigate Membership Inference Attacks (MIA) while preserving predictive accuracy. It contrasts zeroth-order optimization (ZOO) in embedding space, its PCA-restricted variant (ZOO-PCA), and MixUp, showing that ZOO-PCA yields the best privacy-utility tradeoff and that MixUp enhances generalization. The study demonstrates that augmenting training data with synthetic embeddings can significantly reduce the attacker’s advantage, as measured by the TPR/FPR ratio, without compromising test performance; DP-SGD can provide strong privacy but at a substantial utility cost. The findings suggest embedding-space augmentation as a practical defense for privacy-preserving TSF on public EHR datasets, with potential for hybrid approaches and broader applicability to deep architectures and privacy attacks.

Abstract

Balancing strong privacy guarantees with high predictive performance is critical for time series forecasting (TSF) tasks involving Electronic Health Records (EHR). In this study, we explore how data augmentation can mitigate Membership Inference Attacks (MIA) on TSF models. We show that retraining with synthetic data can substantially reduce the effectiveness of loss-based MIAs by reducing the attacker's true-positive to false-positive ratio. The key challenge is generating synthetic samples that closely resemble the original training data to confuse the attacker, while also introducing enough novelty to enhance the model's ability to generalize to unseen data. We examine multiple augmentation strategies - Zeroth-Order Optimization (ZOO), a variant of ZOO constrained by Principal Component Analysis (ZOO-PCA), and MixUp - to strengthen model resilience without sacrificing accuracy. Our experimental results show that ZOO-PCA yields the best reductions in TPR/FPR ratio for MIA attacks without sacrificing performance on test data.

Paper Structure

This paper contains 24 sections, 9 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: The process of time series forecasting for irregularly sampled medical time series data. Real-world data is binned into hourly buckets. The binned data is then transformed into embeddings using a Dense Encoder Embedding Function, capturing the embeddings relevant for the augmentation process. The embeddings are processed by a Transformer encoder, which learns contextual representations. An iterative multistep forecasting (IMS) decoder with autoregressive properties generates forecasts.
  • Figure 2: The bar plots (right axis) show the average number of different clinical variables recorded per patient per hour after admission (multiple measurements of the same variable within one hour are deleted during binning). These counts remain relatively stable over time for both MIMIC (orange) and eICU (blue). In contrast, the line plots (left axis) display the number of patients contributing to each 4-hour sliding window, which declines over time -- more sharply in eICU -- reflecting the decreasing number of long-staying patients.
  • Figure 3: MIMIC-III: ROC curves (log-log scaling) for varying thresholds of loss-based MIAs on models trained with and without data augmentation. Upper plot magnifies the area for FPR $<$ 0.1%. The DP-SGD curve (not shown) is nearly indistinguishable from the diagonal, representing random guessing.
  • Figure 4: eICU: ROC curves (log-log scaling) for varying thresholds of loss-based MIAs on models trained with and without data augmentation. Upper plot magnifies the area for FPR $<$ 0.1%. The DP-SGD curve (not shown) is nearly indistinguishable from the diagonal, representing random guessing.
  • Figure 5: TPR/FPR ratio of MIAs against generalization performance on test data on MIMIC-III (top) resp. eICU (bottom). The size of the interpolation parameter $\alpha \in \{0, \frac{1}{4}, \frac{1}{2}, \frac{3}{4}, 1\}$ for ZOO and ZOO-PCA, resp. $\beta \in \{0.2,1,5\}$ for MixUp in data augmentation is indicated by the size of the ball.

Theorems & Definitions (1)

  • definition 1: Loss-based MIA