FLAIM: AIM-based Synthetic Data Generation in the Federated Setting

Samuel Maddock; Graham Cormode; Carsten Maple

FLAIM: AIM-based Synthetic Data Generation in the Federated Setting

Samuel Maddock, Graham Cormode, Carsten Maple

TL;DR

This work addresses privacy-preserving synthetic data generation in federated settings for tabular data. It builds on AIM by developing DistAIM (distributed) and FLAIM (FL-style) approaches, with AugFLAIM variants to mitigate heterogeneity and communication overhead. Empirical results on diverse datasets show AugFLAIM (Private) often matches DistAIM in utility while significantly reducing overhead, and outperforms federated deep-learning baselines like DP-CTGAN. The methods advance practical DP federated SDG with robust performance under client non-IIDness and limited participation, enabling safer cross-institution data sharing for downstream analytics.

Abstract

Preserving individual privacy while enabling collaborative data sharing is crucial for organizations. Synthetic data generation is one solution, producing artificial data that mirrors the statistical properties of private data. While numerous techniques have been devised under differential privacy, they predominantly assume data is centralized. However, data is often distributed across multiple clients in a federated manner. In this work, we initiate the study of federated synthetic tabular data generation. Building upon a SOTA central method known as AIM, we present DistAIM and FLAIM. We first show that it is straightforward to distribute AIM, extending a recent approach based on secure multi-party computation which necessitates additional overhead, making it less suited to federated scenarios. We then demonstrate that naively federating AIM can lead to substantial degradation in utility under the presence of heterogeneity. To mitigate both issues, we propose an augmented FLAIM approach that maintains a private proxy of heterogeneity. We simulate our methods across a range of benchmark datasets under different degrees of heterogeneity and show we can improve utility while reducing overhead.

FLAIM: AIM-based Synthetic Data Generation in the Federated Setting

TL;DR

Abstract

Paper Structure (36 sections, 3 theorems, 9 equations, 11 figures, 7 tables, 3 algorithms)

This paper contains 36 sections, 3 theorems, 9 equations, 11 figures, 7 tables, 3 algorithms.

Introduction
Preliminaries
Differential Privacy (DP)
Iterative Methods (Select-Measure-Generate).
Towards Decentralized Synthetic Data.
Distributed AIM
FLAIM: FL analog for AIM
NaiveFLAIM and Heterogeneous Data
AugFLAIM (Oracle): Tackling Heterogeneity
AugFLAIM (Private): Heterogeneity Proxy
Experimental Evaluation
Comparison with Existing Baselines
Ablation Study: Utility of AugFLAIM
Parameter Settings
Varying the privacy budget $({\epsilon})$.
...and 21 more sections

Key Result

Lemma 4.1

For any number of global rounds $T$ and local rounds $s$, FLAIM satisfies $({\epsilon},\delta)$-DP , under Gaussian budget allocation $r \in (0,1)$ by computing $\rho$ according to Lemma lemma:cdp, and setting For AugFLAIM methods, the exponential mechanism is applied with sensitivity $\Delta := \max_q 2w_q$.

Figures (11)

Figure 1: Average error over a workload of marginals for (FL)AIM trained with ${\epsilon}=1$ on a toy federated dataset. $\beta$ varies client feature skew where large $\beta$ results in less skew.
Figure 2: Ablation study, comparing utility for FLAIM variations that augment local utility scores, ${\epsilon}=5, T=10, s=1, p=0.1$
Figure 3: Varying (FL)AIM Parameters on Adult; Unless otherwise stated $T=10, s=1, p=0.1, K=100$, ${\epsilon}=1$
Figure 4: SynthFS: Synthetic dataset constructed with feature skew, varying $\beta \in \{1,2,3,5\}$
Figure 5: Clustering approach to form non-IID splits on Adult dataset, $K=100$ clients. All plots show the same embedding formed from UMAP, with Figure \ref{['fig:umap']} showing each client's local dataset formed by clustering in the embedding space. Figures \ref{['fig:age']}-\ref{['fig:income']} show the same embedding but colored based on three features: age, hours worked per-week and income > 50k. The embedding is used only to map examples to clients, and AIM models are trained on the raw data.
...and 6 more figures

Theorems & Definitions (11)

Example
Definition 2.1: Marginal Query
Definition 2.2: Average Workload Error
Definition 2.3: $\rho$-zCDP
Definition 2.4: Sensitivity
Definition 2.5: Gaussian Mechanism
Definition 2.6: Exponential Mechanism
Lemma 4.1
Definition A.1: Differential Privacy dwork2014foundations
Lemma A.2: zCDP to DP canonne2020discrete
...and 1 more

FLAIM: AIM-based Synthetic Data Generation in the Federated Setting

TL;DR

Abstract

FLAIM: AIM-based Synthetic Data Generation in the Federated Setting

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (11)

Theorems & Definitions (11)