Table of Contents
Fetching ...

Privacy Preserving Diffusion Models for Mixed-Type Tabular Data Generation

Timur Sattarov, Marco Schreyer, Damian Borth

TL;DR

The paper tackles privacy-aware synthesis of mixed-type tabular data by marrying diffusion models with embedding-based categorical representations. It introduces DP-FinDiff, a DP-enabled diffusion framework augmented with Adaptive Timestep sampling and Feature-Aggregated loss to mitigate DP noise and gradient clipping effects. Empirical results on finance and healthcare datasets show DP-FinDiff delivering 16-42% higher utility than DP baselines, with further gains from the proposed training enhancements and embedding-based encodings. The work offers a scalable, privacy-preserving approach for sharing sensitive tabular data in high-stakes domains, while acknowledging limitations in fairness evaluation and potential biases in synthetic outputs.

Abstract

We introduce DP-FinDiff, a differentially private diffusion framework for synthesizing mixed-type tabular data. DP-FinDiff employs embedding-based representations for categorical features, reducing encoding overhead and scaling to high-dimensional datasets. To adapt DP-training to the diffusion process, we propose two privacy-aware training strategies: an adaptive timestep sampler that aligns updates with diffusion dynamics, and a feature-aggregated loss that mitigates clipping-induced bias. Together, these enhancements improve fidelity and downstream utility without weakening privacy guarantees. On financial and medical datasets, DP-FinDiff achieves 16-42% higher utility than DP baselines at comparable privacy levels, demonstrating its promise for safe and effective data sharing in sensitive domains.

Privacy Preserving Diffusion Models for Mixed-Type Tabular Data Generation

TL;DR

The paper tackles privacy-aware synthesis of mixed-type tabular data by marrying diffusion models with embedding-based categorical representations. It introduces DP-FinDiff, a DP-enabled diffusion framework augmented with Adaptive Timestep sampling and Feature-Aggregated loss to mitigate DP noise and gradient clipping effects. Empirical results on finance and healthcare datasets show DP-FinDiff delivering 16-42% higher utility than DP baselines, with further gains from the proposed training enhancements and embedding-based encodings. The work offers a scalable, privacy-preserving approach for sharing sensitive tabular data in high-stakes domains, while acknowledging limitations in fairness evaluation and potential biases in synthetic outputs.

Abstract

We introduce DP-FinDiff, a differentially private diffusion framework for synthesizing mixed-type tabular data. DP-FinDiff employs embedding-based representations for categorical features, reducing encoding overhead and scaling to high-dimensional datasets. To adapt DP-training to the diffusion process, we propose two privacy-aware training strategies: an adaptive timestep sampler that aligns updates with diffusion dynamics, and a feature-aggregated loss that mitigates clipping-induced bias. Together, these enhancements improve fidelity and downstream utility without weakening privacy guarantees. On financial and medical datasets, DP-FinDiff achieves 16-42% higher utility than DP baselines at comparable privacy levels, demonstrating its promise for safe and effective data sharing in sensitive domains.

Paper Structure

This paper contains 30 sections, 2 theorems, 20 equations, 8 figures, 4 tables.

Key Result

Proposition 1

Assume that for epoch $k$ the DP-effective signal $s_k(t)$ is monotone in $t$ (decreasing after an initial stabilization period). Then there exists $\alpha_k\in\mathbb{R}$ such that $t^{\alpha_k}$ fits $s_k(t)$ in the least-squares sense on $\{\log t\}$, i.e., $\alpha_k=\arg\min_\alpha \sum_t \bigl(

Figures (8)

  • Figure 1: Schematic of the diffusion process for mixed-type tabular data with DP. Left: forward noise addition from $X_0$ to $X_T$. Middle: DP denoising learning with clipped, noised gradients under budget $(\varepsilon,\delta)$. Right: reverse reconstruction from $X_T$ to $X_0$, producing $(\varepsilon,\delta)$-DP compliant synthetic data.
  • Figure 2: Training time per epoch as dataset size grows (rows & columns).
  • Figure 3: Timestep dynamics in DP-FinDiff. (Left) Gradient norms with DP clipping threshold (red dashed). (Middle) AT sampler heatmap showing the shift from later to earlier diffusion timesteps over training. (Right) Sampling distributions at early, mid, and late phases ($\alpha_{\text{start}}=3$, $\alpha_{\text{end}}=-1$).
  • Figure 4: Per-sample gradient norms on Adult: MSE vs. FA loss. Top: normalized variance over epochs. Bottom: distributions of gradient norms at epochs 100/500/900, showing FA reduces skewness and variance.
  • Figure 5: Utility and fidelity of DP-FinDiff variants with FA, AT, and FA+AT. Enhancements consistently boost results, most notably at $\varepsilon\!=\!0.2$.
  • ...and 3 more figures

Theorems & Definitions (2)

  • Proposition 1: Power-law proxy tracks $q^\star_{\mathrm{DP}}$
  • Proposition 2