Table of Contents
Fetching ...

FusionDP: Foundation Model-Assisted Differentially Private Learning for Partially Sensitive Features

Linghui Zeng, Ruixuan Liu, Atiquer Rahman Sarkar, Xiaoqian Jiang, Joyce C. Ho, Li Xiong

TL;DR

FusionDP addresses privacy when only a subset of features requires protection by imputing sensitive attributes with foundation-model priors and training under a feature-DP framework. It introduces a two-branch training objective that combines public gradients on hybrid data with DP-SGD-protected private gradients, plus a representation-consistency regularizer to align original and imputed representations. The method is rigorously analyzed under $f$-DP$^{\Psi}_r$ with a concrete noise schedule and privacy amplification via sub-sampling, and demonstrates improved utility over DP-SGD and existing feature-DP baselines on both tabular sepsis data and MIMIC-III clinical notes. These results highlight the practical value of integrating foundation-model priors for selectively private data, enabling stronger privacy guarantees without sacrificing predictive performance across modalities.

Abstract

Ensuring the privacy of sensitive training data is crucial in privacy-preserving machine learning. However, in practical scenarios, privacy protection may be required for only a subset of features. For instance, in ICU data, demographic attributes like age and gender pose higher privacy risks due to their re-identification potential, whereas raw lab results are generally less sensitive. Traditional DP-SGD enforces privacy protection on all features in one sample, leading to excessive noise injection and significant utility degradation. We propose FusionDP, a two-step framework that enhances model utility under feature-level differential privacy. First, FusionDP leverages large foundation models to impute sensitive features given non-sensitive features, treating them as external priors that provide high-quality estimates of sensitive attributes without accessing the true values during model training. Second, we introduce a modified DP-SGD algorithm that trains models on both original and imputed features while formally preserving the privacy of the original sensitive features. We evaluate FusionDP on two modalities: a sepsis prediction task on tabular data from PhysioNet and a clinical note classification task from MIMIC-III. By comparing against privacy-preserving baselines, our results show that FusionDP significantly improves model performance while maintaining rigorous feature-level privacy, demonstrating the potential of foundation model-driven imputation to enhance the privacy-utility trade-off for various modalities.

FusionDP: Foundation Model-Assisted Differentially Private Learning for Partially Sensitive Features

TL;DR

FusionDP addresses privacy when only a subset of features requires protection by imputing sensitive attributes with foundation-model priors and training under a feature-DP framework. It introduces a two-branch training objective that combines public gradients on hybrid data with DP-SGD-protected private gradients, plus a representation-consistency regularizer to align original and imputed representations. The method is rigorously analyzed under -DP with a concrete noise schedule and privacy amplification via sub-sampling, and demonstrates improved utility over DP-SGD and existing feature-DP baselines on both tabular sepsis data and MIMIC-III clinical notes. These results highlight the practical value of integrating foundation-model priors for selectively private data, enabling stronger privacy guarantees without sacrificing predictive performance across modalities.

Abstract

Ensuring the privacy of sensitive training data is crucial in privacy-preserving machine learning. However, in practical scenarios, privacy protection may be required for only a subset of features. For instance, in ICU data, demographic attributes like age and gender pose higher privacy risks due to their re-identification potential, whereas raw lab results are generally less sensitive. Traditional DP-SGD enforces privacy protection on all features in one sample, leading to excessive noise injection and significant utility degradation. We propose FusionDP, a two-step framework that enhances model utility under feature-level differential privacy. First, FusionDP leverages large foundation models to impute sensitive features given non-sensitive features, treating them as external priors that provide high-quality estimates of sensitive attributes without accessing the true values during model training. Second, we introduce a modified DP-SGD algorithm that trains models on both original and imputed features while formally preserving the privacy of the original sensitive features. We evaluate FusionDP on two modalities: a sepsis prediction task on tabular data from PhysioNet and a clinical note classification task from MIMIC-III. By comparing against privacy-preserving baselines, our results show that FusionDP significantly improves model performance while maintaining rigorous feature-level privacy, demonstrating the potential of foundation model-driven imputation to enhance the privacy-utility trade-off for various modalities.

Paper Structure

This paper contains 33 sections, 1 theorem, 7 equations, 5 figures, 5 tables, 1 algorithm.

Key Result

Theorem 1

Let the per-example gradient of the private loss be clipped to norm $C$, and let $p$ be the Poisson sampling rate for the private batch. Suppose the private loss is computed using the original input $x = (x_{\text{pub}}, x_{\text{priv}})$ and the imputed hybrid input $\tilde{x} = (x_{\text{pub}}, \h where $\tau$ is the (post-clipping) Lipschitz bound of the private loss, and $\sigma$ is the Gaussi

Figures (5)

  • Figure 1: Overview of FusionDP. Sensitive (purple) features in the original data are masked and then imputed (space gray) using foundation models. The hybrid data is used for public gradient updates. Gradients involving private data (red arrows) are protected by clipping and noises.
  • Figure 2: Best AUPRC (average of 3 runs) on the Sepsis prediction task across privacy budgets ($\epsilon$).
  • Figure 3: Snippet illustrating the flow from raw (original) clinical text to masked (public) text to imputed (hybrid) text.
  • Figure 4: Prompt for imputing redacted notes with GPT-4o-mini
  • Figure 5: Experiment on Sepsis prediction with only age, gender, ICU unit type as sensitive features. FusionDP consistently outperforms other private baselines.

Theorems & Definitions (2)

  • Definition 1: Feature Differential Privacy featuredp
  • Theorem 1: FusionDP Privacy, $(\varepsilon,\delta)$-DP$^{\psi}_i$