Table of Contents
Fetching ...

Differentially Private Wasserstein Barycenters

Anming Gu, Sasidhar Kunapuli, Mark Bun, Edward Chien, Kristjan Greenewald

TL;DR

The paper tackles the challenge of computing Wasserstein barycenters under differential privacy in the central model, where each individual contributes a single datapoint to one empirical distribution. It introduces two DP frameworks: (i) a private coreset approach that reduces to private Wasserstein distance coresets with Johnson-Lindenstrauss dimensionality reduction, yielding an $\varepsilon$-DP, $(z(1+\gamma), \tilde{O}_{p,\gamma,z,\xi}((1/(\varepsilon n)^{1/d}) + t))$-approximate barycenter; and (ii) an output perturbation approach that applies Gaussian noise to the barycenter output, achieving an $(\varepsilon,\delta)$-DP guarantee with a bound of $\tilde{O}_{p,\gamma,z,\xi}((md\log(1/\delta))/(\varepsilon k)^2)^{p/2}$ plus an approximation term, with improvements under clustered data via distribution splitting. The authors provide theoretical guarantees and demonstrate strong empirical results on synthetic data, MNIST, and large-scale US population datasets, highlighting practical privacy-utility tradeoffs and the role of JL projection and subsampling. The work advances private synthesis and deployment by enabling DP Wasserstein barycenters and suggests directions for reducing dimensionality challenges and exploring alternative privacy models. Overall, the paper delivers rigorous DP algorithms for Wasserstein barycenters, backed by both theory and real-data experiments, with meaningful implications for privacy-preserving data analysis and synthetic data generation.

Abstract

The Wasserstein barycenter is defined as the mean of a set of probability measures under the optimal transport metric, and has numerous applications spanning machine learning, statistics, and computer graphics. In practice these input measures are empirical distributions built from sensitive datasets, motivating a differentially private (DP) treatment. We present, to our knowledge, the first algorithms for computing Wasserstein barycenters under differential privacy. Empirically, on synthetic data, MNIST, and large-scale U.S. population datasets, our methods produce high-quality private barycenters with strong accuracy-privacy tradeoffs.

Differentially Private Wasserstein Barycenters

TL;DR

The paper tackles the challenge of computing Wasserstein barycenters under differential privacy in the central model, where each individual contributes a single datapoint to one empirical distribution. It introduces two DP frameworks: (i) a private coreset approach that reduces to private Wasserstein distance coresets with Johnson-Lindenstrauss dimensionality reduction, yielding an -DP, -approximate barycenter; and (ii) an output perturbation approach that applies Gaussian noise to the barycenter output, achieving an -DP guarantee with a bound of plus an approximation term, with improvements under clustered data via distribution splitting. The authors provide theoretical guarantees and demonstrate strong empirical results on synthetic data, MNIST, and large-scale US population datasets, highlighting practical privacy-utility tradeoffs and the role of JL projection and subsampling. The work advances private synthesis and deployment by enabling DP Wasserstein barycenters and suggests directions for reducing dimensionality challenges and exploring alternative privacy models. Overall, the paper delivers rigorous DP algorithms for Wasserstein barycenters, backed by both theory and real-data experiments, with meaningful implications for privacy-preserving data analysis and synthetic data generation.

Abstract

The Wasserstein barycenter is defined as the mean of a set of probability measures under the optimal transport metric, and has numerous applications spanning machine learning, statistics, and computer graphics. In practice these input measures are empirical distributions built from sensitive datasets, motivating a differentially private (DP) treatment. We present, to our knowledge, the first algorithms for computing Wasserstein barycenters under differential privacy. Empirically, on synthetic data, MNIST, and large-scale U.S. population datasets, our methods produce high-quality private barycenters with strong accuracy-privacy tradeoffs.

Paper Structure

This paper contains 28 sections, 30 theorems, 55 equations, 9 figures, 5 algorithms.

Key Result

Proposition 4.1

If $W_p(\mu,\mu')\le t$, then $\mu'$ is a $(p, 1, t)$-coreset of $\mu$ for the $p$-Wasserstein distance problem.

Figures (9)

  • Figure 1: Example of a solution. The input distributions are $\mu_a := \frac{1}{2}\delta_{a_1} + \frac{1}{2}\delta_{a_2}, \mu_b := \frac{1}{2}\delta_{b_1} + \frac{1}{2}\delta_{b_2}, \mu_c := \frac{1}{2}\delta_{c_1} + \frac{1}{2}\delta_{c_2}$ and the candidate barycenter is $\nu := \frac{1}{2}\delta_{\nu^{(1)}} + \frac{1}{2}\delta_{\nu^{(2)}}$. Observe that: $S_1 = \{a_1, b_1, c_1\}, S_2 = \{a_2, b_2, c_2\}, w_1(a_1) = w_1(b_1) = w_1(c_1)= w_2(a_2) = w_2(b_2) = w_2(c_2) = 1$.
  • Figure 2: Synthetic experiments testing sample size $n$, privacy parameter $\varepsilon$, and projection dimension $d'$, averaged over 30 runs for the private coreset approach of Section \ref{['sec:coreset']}.
  • Figure 3: Barycenters on continental US populations.
  • Figure 4: $n = 2000, 4000, \dots, 128000$ and $\varepsilon = 1$ (and $\delta = \frac{1}{n}$) in the same experimental setup as Figure \ref{['fig:us_k=1']}, averaged over 10 trials. On the left, we have cost in squared degrees. On the right, we plot the 2-Wasserstein distance between the private and non-private barycenters (in degrees).
  • Figure 5: Unperturbed data is uniform over $\mathbb{S}^1$. Here, the averages of any of the two disjoint half-arcs yield an optimal barycenter. However, with a bad initialization, each point in the support of the output distribution can move $\Omega(1)$ as $\Omega(n)$ of the couplings change.
  • ...and 4 more figures

Theorems & Definitions (59)

  • Definition 1: $(\varepsilon,\delta)$-DP
  • Definition 2: Wasserstein distance
  • Definition 3: Wasserstein barycenter
  • Definition 4: Neighboring datasets
  • Definition 5: Solution
  • Definition 6: Cost
  • Definition 7: Approximate Wasserstein barycenter
  • Definition 8: Coreset for Wasserstein distance
  • Proposition 4.1
  • Theorem 4.2
  • ...and 49 more