Table of Contents
Fetching ...

FedTabDiff: Federated Learning of Diffusion Probabilistic Models for Synthetic Mixed-Type Tabular Data Generation

Timur Sattarov, Marco Schreyer, Damian Borth

TL;DR

This paper tackles privacy concerns in synthetic tabular data generation by proposing FedTabDiff, a federated diffusion-based approach that trains DDPMs across distributed clients without exchanging raw data. It integrates a central diffusion model with client-specific models using synchronous FedAvg-style updates to enable collaborative learning while preserving data locality. Experiments on real-world datasets from finance and healthcare demonstrate superior fidelity, utility, privacy, and coverage compared to non-federated baselines, even under non-iid data partitions. The framework offers a practical pathway for privacy-preserving data sharing and analytics in sensitive domains.

Abstract

Realistic synthetic tabular data generation encounters significant challenges in preserving privacy, especially when dealing with sensitive information in domains like finance and healthcare. In this paper, we introduce \textit{Federated Tabular Diffusion} (FedTabDiff) for generating high-fidelity mixed-type tabular data without centralized access to the original tabular datasets. Leveraging the strengths of \textit{Denoising Diffusion Probabilistic Models} (DDPMs), our approach addresses the inherent complexities in tabular data, such as mixed attribute types and implicit relationships. More critically, FedTabDiff realizes a decentralized learning scheme that permits multiple entities to collaboratively train a generative model while respecting data privacy and locality. We extend DDPMs into the federated setting for tabular data generation, which includes a synchronous update scheme and weighted averaging for effective model aggregation. Experimental evaluations on real-world financial and medical datasets attest to the framework's capability to produce synthetic data that maintains high fidelity, utility, privacy, and coverage.

FedTabDiff: Federated Learning of Diffusion Probabilistic Models for Synthetic Mixed-Type Tabular Data Generation

TL;DR

This paper tackles privacy concerns in synthetic tabular data generation by proposing FedTabDiff, a federated diffusion-based approach that trains DDPMs across distributed clients without exchanging raw data. It integrates a central diffusion model with client-specific models using synchronous FedAvg-style updates to enable collaborative learning while preserving data locality. Experiments on real-world datasets from finance and healthcare demonstrate superior fidelity, utility, privacy, and coverage compared to non-federated baselines, even under non-iid data partitions. The framework offers a practical pathway for privacy-preserving data sharing and analytics in sensitive domains.

Abstract

Realistic synthetic tabular data generation encounters significant challenges in preserving privacy, especially when dealing with sensitive information in domains like finance and healthcare. In this paper, we introduce \textit{Federated Tabular Diffusion} (FedTabDiff) for generating high-fidelity mixed-type tabular data without centralized access to the original tabular datasets. Leveraging the strengths of \textit{Denoising Diffusion Probabilistic Models} (DDPMs), our approach addresses the inherent complexities in tabular data, such as mixed attribute types and implicit relationships. More critically, FedTabDiff realizes a decentralized learning scheme that permits multiple entities to collaboratively train a generative model while respecting data privacy and locality. We extend DDPMs into the federated setting for tabular data generation, which includes a synchronous update scheme and weighted averaging for effective model aggregation. Experimental evaluations on real-world financial and medical datasets attest to the framework's capability to produce synthetic data that maintains high fidelity, utility, privacy, and coverage.
Paper Structure (9 sections, 9 equations, 2 figures, 2 tables)

This paper contains 9 sections, 9 equations, 2 figures, 2 tables.

Figures (2)

  • Figure 1: Schematic representation of the proposed FedTabDiff model. It illustrates how each client $\omega_{i}$ independently trains a local diffusion model, named FinDiffsattarov2023findiff. The process is depicted through various timesteps $X_T, \ldots, X_1, X_0$, each representing different stages of latent data representations in the reverse diffusion process. The individual model parameters, denoted as $\theta_{i}^{\omega}$, are periodically aggregated on a central server to form the consolidated model $\theta^{\xi}$. After every training round, the server then redistributes this consolidated model to each client.
  • Figure 2: Evaluation of Federated (FedTabDiff) vs. Non-Federated (FinDiff) Diffusion Models in terms of fidelity, privacy, and coverage and across individual clients ($\omega_i$). For the Federated model, the aggregated central model is evaluated across all client data subsets ($\mathcal{D}_1$, $\mathcal{D}_2$, ..., $\mathcal{D}_\lambda$). For the Non-Federated model, each client's model is trained on its respective data subset ($\mathcal{D}_i$) and evaluated across all subsets ($\mathcal{D}_1$, $\mathcal{D}_2$, ..., $\mathcal{D}_\lambda$).