Table of Contents
Fetching ...

Private Wasserstein Distance

Wenqian Li, Yan Pang

TL;DR

The paper tackles the challenge of estimating the $2$-Wasserstein distance $\mathcal{W}_2(\mu,\nu)$ across privacy-sensitive distributed datasets without sharing raw data. It introduces TriangleWad, a privacy-preserving framework that leverages Wasserstein geodesics with a private Gaussian reference to bound $\mathcal{W}_2(\mu,\nu)$ in one round of interaction, avoiding direct interpolation between the raw distributions and hidden transport plans. Theoretical results establish a bounded approximation error tied to the reference distribution $\gamma$ and push-forward parameter $t$, along with complexity and attack-defense analyses. Empirically, TriangleWad delivers competitive accuracy with reduced computation on image and text tasks, and it enables broader applications in labeled-data distances, data valuation, noisy-data detection, and multi-source distance computation, with demonstrated utility in FL and data marketplaces. The work offers a practical, privacy-conscious mechanism for distributional similarity assessment that scales to real-world privacy-sensitive settings.

Abstract

Wasserstein distance is a key metric for quantifying data divergence from a distributional perspective. However, its application in privacy-sensitive environments, where direct sharing of raw data is prohibited, presents significant challenges. Existing approaches, such as Differential Privacy and Federated Optimization, have been employed to estimate the Wasserstein distance under such constraints. However, these methods often fall short when both accuracy and security are required. In this study, we explore the inherent triangular properties within the Wasserstein space, leading to a novel solution named TriangleWad. This approach facilitates the fast computation of the Wasserstein distance between datasets stored across different entities, ensuring that raw data remain completely hidden. TriangleWad not only strengthens resistance to potential attacks but also preserves high estimation accuracy. Through extensive experiments across various tasks involving both image and text data, we demonstrate its superior performance and significant potential for real-world applications.

Private Wasserstein Distance

TL;DR

The paper tackles the challenge of estimating the -Wasserstein distance across privacy-sensitive distributed datasets without sharing raw data. It introduces TriangleWad, a privacy-preserving framework that leverages Wasserstein geodesics with a private Gaussian reference to bound in one round of interaction, avoiding direct interpolation between the raw distributions and hidden transport plans. Theoretical results establish a bounded approximation error tied to the reference distribution and push-forward parameter , along with complexity and attack-defense analyses. Empirically, TriangleWad delivers competitive accuracy with reduced computation on image and text tasks, and it enables broader applications in labeled-data distances, data valuation, noisy-data detection, and multi-source distance computation, with demonstrated utility in FL and data marketplaces. The work offers a practical, privacy-conscious mechanism for distributional similarity assessment that scales to real-world privacy-sensitive settings.

Abstract

Wasserstein distance is a key metric for quantifying data divergence from a distributional perspective. However, its application in privacy-sensitive environments, where direct sharing of raw data is prohibited, presents significant challenges. Existing approaches, such as Differential Privacy and Federated Optimization, have been employed to estimate the Wasserstein distance under such constraints. However, these methods often fall short when both accuracy and security are required. In this study, we explore the inherent triangular properties within the Wasserstein space, leading to a novel solution named TriangleWad. This approach facilitates the fast computation of the Wasserstein distance between datasets stored across different entities, ensuring that raw data remain completely hidden. TriangleWad not only strengthens resistance to potential attacks but also preserves high estimation accuracy. Through extensive experiments across various tasks involving both image and text data, we demonstrate its superior performance and significant potential for real-world applications.
Paper Structure (43 sections, 6 theorems, 40 equations, 12 figures, 4 tables)

This paper contains 43 sections, 6 theorems, 40 equations, 12 figures, 4 tables.

Key Result

Theorem 3.1

Suppose $\gamma \in \mathbb{R}^{k\times d} \sim \mathcal{N}(\mu_\gamma,\sigma^2_\gamma)$. Let $\pi^\star(\mu,\gamma)\in\mathbb{R}^{m\times k}$ be the OT plan between $\mu$ and $\gamma$, $\pi^\star(\nu,\gamma)\in\mathbb{R}^{n\times k}$ be the OT plan between $\nu$ and $\gamma$. If $\eta_\mu$ and $\et with the condition that both measures have the same push parameters, e.g. $t=s$, then the approxima

Figures (12)

  • Figure 1: Left: The blue line refers to the distance between the constructed attack data ("Attack") and the target data ("P"). The attack data will gradually converge to the target data with identical distribution; Middle: interpolating measure $\xi$ in FedWad; Right: Extracted clean image from $\xi$ via the distributional attack.
  • Figure 2: Technical Comparison: In previous work rakotomamonjyfederated, two Wasserstein balls $\mathcal{B}(\mu,\mathcal{W}(\mu,\xi))$ and $\mathcal{B}(\mu,\mathcal{W}(\mu,\nu))$, along with the condition that $\mu,\xi,\nu$ lie on the same geodesics, could uniquely determine the distribution of $\nu$. TriangleWad does not have such an interpolating measure between $\mu$ and $\nu$. Simultaneously, $\mathcal{W}(\nu,\gamma),\mathcal{W}(\nu,\eta_\mu),\mathcal{W}(\eta_\nu,\gamma)$ are private information. $IM(a,b)$ represents the interpolating measure between $a$ and $b$
  • Figure 3: Noisy Feature Detection on CIFAR10 and one tabular dataset Adult. Our approach has better noisy detection ability compared to other data valuation approaches. It is worthy to note that others need to use raw data, while TriangleWad could be used in the private setting.
  • Figure 4: Our approach has low test error (MSE) on both synthetic data and real-world medical imaging data
  • Figure 5: Line plots: The lines of Predicted Wasserstein distance (black) and actual Wasserstein distance (green) between interpolating measures are overlapping. When $t=s_0$, the $\hat{\mathcal{W}}(\mu,\nu)$ has minimal gap with $\mathcal{W}_2(\mu,\nu)$ ; Dot plots: Predicted distance vs. actual distance between two interpolating measures. Orange dots are for fitting and black dots are for predictions.
  • ...and 7 more figures

Theorems & Definitions (10)

  • Definition 2.1
  • Definition 2.2
  • Theorem 3.1
  • Corollary 3.2
  • Corollary 3.3
  • Remark 3.4
  • Theorem 3.5
  • Theorem 4.1
  • Theorem 2.1
  • proof