Differentially Private Wasserstein Barycenters
Anming Gu, Sasidhar Kunapuli, Mark Bun, Edward Chien, Kristjan Greenewald
TL;DR
The paper tackles the challenge of computing Wasserstein barycenters under differential privacy in the central model, where each individual contributes a single datapoint to one empirical distribution. It introduces two DP frameworks: (i) a private coreset approach that reduces to private Wasserstein distance coresets with Johnson-Lindenstrauss dimensionality reduction, yielding an $\varepsilon$-DP, $(z(1+\gamma), \tilde{O}_{p,\gamma,z,\xi}((1/(\varepsilon n)^{1/d}) + t))$-approximate barycenter; and (ii) an output perturbation approach that applies Gaussian noise to the barycenter output, achieving an $(\varepsilon,\delta)$-DP guarantee with a bound of $\tilde{O}_{p,\gamma,z,\xi}((md\log(1/\delta))/(\varepsilon k)^2)^{p/2}$ plus an approximation term, with improvements under clustered data via distribution splitting. The authors provide theoretical guarantees and demonstrate strong empirical results on synthetic data, MNIST, and large-scale US population datasets, highlighting practical privacy-utility tradeoffs and the role of JL projection and subsampling. The work advances private synthesis and deployment by enabling DP Wasserstein barycenters and suggests directions for reducing dimensionality challenges and exploring alternative privacy models. Overall, the paper delivers rigorous DP algorithms for Wasserstein barycenters, backed by both theory and real-data experiments, with meaningful implications for privacy-preserving data analysis and synthetic data generation.
Abstract
The Wasserstein barycenter is defined as the mean of a set of probability measures under the optimal transport metric, and has numerous applications spanning machine learning, statistics, and computer graphics. In practice these input measures are empirical distributions built from sensitive datasets, motivating a differentially private (DP) treatment. We present, to our knowledge, the first algorithms for computing Wasserstein barycenters under differential privacy. Empirically, on synthetic data, MNIST, and large-scale U.S. population datasets, our methods produce high-quality private barycenters with strong accuracy-privacy tradeoffs.
