Table of Contents
Fetching ...

Differentially Private Neural Tangent Kernels for Privacy-Preserving Data Generation

Yilin Yang, Kamil Adamczewski, Danica J. Sutherland, Xiaoxiao Li, Mijung Park

TL;DR

This work introduces DP-NTK, a practical framework for differential privacy in data generation that uses finite-dimensional empirical Neural Tangent Kernel features within a kernel mean embedding and MMD objective. By privatizing the mean embedding with the Gaussian mechanism and training a generator to minimize the privatized MMD, DP-NTK achieves strong privacy guarantees while maintaining high utility, even without public data. Theoretical analysis shows that the private minimizer closely tracks the non-private optimum, with favorable rates, and empirical results across MNIST, FashionMNIST, CelebA, CIFAR-10, and tabular datasets demonstrate competitive or superior performance relative to state-of-the-art private generators. The approach offers a scalable, data-efficient pathway for privacy-preserving data synthesis with broad applicability across vision and tabular domains.

Abstract

Maximum mean discrepancy (MMD) is a particularly useful distance metric for differentially private data generation: when used with finite-dimensional features it allows us to summarize and privatize the data distribution once, which we can repeatedly use during generator training without further privacy loss. An important question in this framework is, then, what features are useful to distinguish between real and synthetic data distributions, and whether those enable us to generate quality synthetic data. This work considers the using the features of $\textit{neural tangent kernels (NTKs)}$, more precisely $\textit{empirical}$ NTKs (e-NTKs). We find that, perhaps surprisingly, the expressiveness of the untrained e-NTK features is comparable to that of the features taken from pre-trained perceptual features using public data. As a result, our method improves the privacy-accuracy trade-off compared to other state-of-the-art methods, without relying on any public data, as demonstrated on several tabular and image benchmark datasets.

Differentially Private Neural Tangent Kernels for Privacy-Preserving Data Generation

TL;DR

This work introduces DP-NTK, a practical framework for differential privacy in data generation that uses finite-dimensional empirical Neural Tangent Kernel features within a kernel mean embedding and MMD objective. By privatizing the mean embedding with the Gaussian mechanism and training a generator to minimize the privatized MMD, DP-NTK achieves strong privacy guarantees while maintaining high utility, even without public data. Theoretical analysis shows that the private minimizer closely tracks the non-private optimum, with favorable rates, and empirical results across MNIST, FashionMNIST, CelebA, CIFAR-10, and tabular datasets demonstrate competitive or superior performance relative to state-of-the-art private generators. The approach offers a scalable, data-efficient pathway for privacy-preserving data synthesis with broad applicability across vision and tabular domains.

Abstract

Maximum mean discrepancy (MMD) is a particularly useful distance metric for differentially private data generation: when used with finite-dimensional features it allows us to summarize and privatize the data distribution once, which we can repeatedly use during generator training without further privacy loss. An important question in this framework is, then, what features are useful to distinguish between real and synthetic data distributions, and whether those enable us to generate quality synthetic data. This work considers the using the features of , more precisely NTKs (e-NTKs). We find that, perhaps surprisingly, the expressiveness of the untrained e-NTK features is comparable to that of the features taken from pre-trained perceptual features using public data. As a result, our method improves the privacy-accuracy trade-off compared to other state-of-the-art methods, without relying on any public data, as demonstrated on several tabular and image benchmark datasets.
Paper Structure (15 sections, 2 theorems, 13 equations, 4 figures, 5 tables, 1 algorithm)

This paper contains 15 sections, 2 theorems, 13 equations, 4 figures, 5 tables, 1 algorithm.

Key Result

Proposition 3.1

The global sensitivity of the mean embedding eq:me_data is $\Delta_{\mathbf{\bm{\mu}}_P} = 2 / m$.

Figures (4)

  • Figure 1: Generated samples of MNIST and FashionMNIST from DP-NTK with different widths $w$; all samples use the same DP noise level ($\epsilon=10$, $\delta = 10^{-5}$).
  • Figure 2: DP-NTK under different DP levels (left) and comparison results with different models (right) for MNIST and FashionMNIST
  • Figure 3: Synthetic $32 \times 32$ CelebA samples generated at different levels of privacy. Samples for DP-MERF and DP-Sinkhorn are taken from ? (? ). Our method yields samples of higher visual quality than the comparison methods. The FID for the proposed method is 75. FID for DP-Sinkhorn is 189. FID for DP-MERF is 274.
  • Figure 4: The generated and real images for the CIFAR-10 dataset. The FID scores for the proposed method are 104 ($\epsilon=\infty$) and 107 ($\epsilon=10$), respectively. For DP-MERF, they are 127 ($\epsilon=\infty$) and 141 ($\epsilon=10$).

Theorems & Definitions (3)

  • Proposition 3.1
  • proof
  • Proposition 4.1