Differentially Private Neural Tangent Kernels for Privacy-Preserving Data Generation
Yilin Yang, Kamil Adamczewski, Danica J. Sutherland, Xiaoxiao Li, Mijung Park
TL;DR
This work introduces DP-NTK, a practical framework for differential privacy in data generation that uses finite-dimensional empirical Neural Tangent Kernel features within a kernel mean embedding and MMD objective. By privatizing the mean embedding with the Gaussian mechanism and training a generator to minimize the privatized MMD, DP-NTK achieves strong privacy guarantees while maintaining high utility, even without public data. Theoretical analysis shows that the private minimizer closely tracks the non-private optimum, with favorable rates, and empirical results across MNIST, FashionMNIST, CelebA, CIFAR-10, and tabular datasets demonstrate competitive or superior performance relative to state-of-the-art private generators. The approach offers a scalable, data-efficient pathway for privacy-preserving data synthesis with broad applicability across vision and tabular domains.
Abstract
Maximum mean discrepancy (MMD) is a particularly useful distance metric for differentially private data generation: when used with finite-dimensional features it allows us to summarize and privatize the data distribution once, which we can repeatedly use during generator training without further privacy loss. An important question in this framework is, then, what features are useful to distinguish between real and synthetic data distributions, and whether those enable us to generate quality synthetic data. This work considers the using the features of $\textit{neural tangent kernels (NTKs)}$, more precisely $\textit{empirical}$ NTKs (e-NTKs). We find that, perhaps surprisingly, the expressiveness of the untrained e-NTK features is comparable to that of the features taken from pre-trained perceptual features using public data. As a result, our method improves the privacy-accuracy trade-off compared to other state-of-the-art methods, without relying on any public data, as demonstrated on several tabular and image benchmark datasets.
