Table of Contents
Fetching ...

FHAIM: Fully Homomorphic AIM For Private Synthetic Data Generation

Mayank Kumar, Qian Lou, Paulo Barreto, Martine De Cock, Sikha Pentyala

TL;DR

FHAIM tackles input privacy in outsourced synthetic data generation by training a marginal-based generator directly on encrypted tabular data using a CKKS-based fully homomorphic encryption scheme. It realizes DP-in-FHE through novel protocols for marginal computation, encrypted DP noise injection, and encrypted query selection, employing a squared $L_2$ quality score to stabilize the selection step. Empirical results on real datasets demonstrate that FHAIM preserves the utility of the AIM baseline under $(\varepsilon,\delta)$-DP while delivering practical runtimes (approximately 11–30 minutes) and modest memory usage, thereby enabling private SDG as a service in privacy-sensitive domains. This work shows that input privacy and formal DP guarantees can be achieved without multi-party coordination, paving the way for scalable privacy-preserving data sharing in healthcare, finance, and beyond.

Abstract

Data is the lifeblood of AI, yet much of the most valuable data remains locked in silos due to privacy and regulations. As a result, AI remains heavily underutilized in many of the most important domains, including healthcare, education, and finance. Synthetic data generation (SDG), i.e. the generation of artificial data with a synthesizer trained on real data, offers an appealing solution to make data available while mitigating privacy concerns, however existing SDG-as-a-service workflow require data holders to trust providers with access to private data.We propose FHAIM, the first fully homomorphic encryption (FHE) framework for training a marginal-based synthetic data generator on encrypted tabular data. FHAIM adapts the widely used AIM algorithm to the FHE setting using novel FHE protocols, ensuring that the private data remains encrypted throughout and is released only with differential privacy guarantees. Our empirical analysis show that FHAIM preserves the performance of AIM while maintaining feasible runtimes.

FHAIM: Fully Homomorphic AIM For Private Synthetic Data Generation

TL;DR

FHAIM tackles input privacy in outsourced synthetic data generation by training a marginal-based generator directly on encrypted tabular data using a CKKS-based fully homomorphic encryption scheme. It realizes DP-in-FHE through novel protocols for marginal computation, encrypted DP noise injection, and encrypted query selection, employing a squared quality score to stabilize the selection step. Empirical results on real datasets demonstrate that FHAIM preserves the utility of the AIM baseline under -DP while delivering practical runtimes (approximately 11–30 minutes) and modest memory usage, thereby enabling private SDG as a service in privacy-sensitive domains. This work shows that input privacy and formal DP guarantees can be achieved without multi-party coordination, paving the way for scalable privacy-preserving data sharing in healthcare, finance, and beyond.

Abstract

Data is the lifeblood of AI, yet much of the most valuable data remains locked in silos due to privacy and regulations. As a result, AI remains heavily underutilized in many of the most important domains, including healthcare, education, and finance. Synthetic data generation (SDG), i.e. the generation of artificial data with a synthesizer trained on real data, offers an appealing solution to make data available while mitigating privacy concerns, however existing SDG-as-a-service workflow require data holders to trust providers with access to private data.We propose FHAIM, the first fully homomorphic encryption (FHE) framework for training a marginal-based synthetic data generator on encrypted tabular data. FHAIM adapts the widely used AIM algorithm to the FHE setting using novel FHE protocols, ensuring that the private data remains encrypted throughout and is released only with differential privacy guarantees. Our empirical analysis show that FHAIM preserves the performance of AIM while maintaining feasible runtimes.
Paper Structure (23 sections, 2 theorems, 9 equations, 4 figures, 4 tables, 4 algorithms)

This paper contains 23 sections, 2 theorems, 9 equations, 4 figures, 4 tables, 4 algorithms.

Key Result

Theorem D.1

Let $q_w(D)$ be a marginal and $\hat{q}_w$ be the estimated marginal from the graphical model. Define the quality score as $s(w,D) = \alpha_w \left(\left\| q_w(D) - q_w(\hat{D}) \right\|_2^2 - \rho\right)$, where Under unbounded differential privacy, the global sensitivity of $s(w,.)$ is $\Delta s(w

Figures (4)

  • Figure 1: Overview of $\texttt{FHAIM}$. (Left) system entities; (right) end-to-end workflow.
  • Figure 2: Utility Comparison. Classification accuracy across three datasets. The results isolate the impact of polynomial approximation errors in the $L_1$ norm (blue bars) versus the stability of the exact squared $L_2$ norm (green bars). Red dashed lines indicate real data performance.
  • Figure 3: Runtime Analysis. Breakdown of execution times (in seconds) for the Compute, Select, and Measure protocols across datasets. The one-time Compute phase (left) dominates the total runtime and is invariant to the choice of norm. In the iterative Select (center) and Measure (right) phases, the exact Squared $L_2$ protocol (red) incurs a modest computational overhead compared to the degree-10 $L_1$ approximation (blue), reflecting the trade-off between the raw speed of low-degree approximations and the numerical stability of the exact squared norm.
  • Figure 4: Illustration of 2-way marginal computation

Theorems & Definitions (6)

  • Definition 3.1: Gaussian Mechanism
  • Definition 3.2: Exponential Mechanism
  • Theorem D.1: Sensitivity of Squared $L_2$ Quality Score
  • proof
  • Theorem D.2: Expected Squared $L_2$ Penalty
  • proof