Privacy Preserving Diffusion Models for Mixed-Type Tabular Data Generation
Timur Sattarov, Marco Schreyer, Damian Borth
TL;DR
The paper tackles privacy-aware synthesis of mixed-type tabular data by marrying diffusion models with embedding-based categorical representations. It introduces DP-FinDiff, a DP-enabled diffusion framework augmented with Adaptive Timestep sampling and Feature-Aggregated loss to mitigate DP noise and gradient clipping effects. Empirical results on finance and healthcare datasets show DP-FinDiff delivering 16-42% higher utility than DP baselines, with further gains from the proposed training enhancements and embedding-based encodings. The work offers a scalable, privacy-preserving approach for sharing sensitive tabular data in high-stakes domains, while acknowledging limitations in fairness evaluation and potential biases in synthetic outputs.
Abstract
We introduce DP-FinDiff, a differentially private diffusion framework for synthesizing mixed-type tabular data. DP-FinDiff employs embedding-based representations for categorical features, reducing encoding overhead and scaling to high-dimensional datasets. To adapt DP-training to the diffusion process, we propose two privacy-aware training strategies: an adaptive timestep sampler that aligns updates with diffusion dynamics, and a feature-aggregated loss that mitigates clipping-induced bias. Together, these enhancements improve fidelity and downstream utility without weakening privacy guarantees. On financial and medical datasets, DP-FinDiff achieves 16-42% higher utility than DP baselines at comparable privacy levels, demonstrating its promise for safe and effective data sharing in sensitive domains.
