Private Wasserstein Distance
Wenqian Li, Yan Pang
TL;DR
The paper tackles the challenge of estimating the $2$-Wasserstein distance $\mathcal{W}_2(\mu,\nu)$ across privacy-sensitive distributed datasets without sharing raw data. It introduces TriangleWad, a privacy-preserving framework that leverages Wasserstein geodesics with a private Gaussian reference to bound $\mathcal{W}_2(\mu,\nu)$ in one round of interaction, avoiding direct interpolation between the raw distributions and hidden transport plans. Theoretical results establish a bounded approximation error tied to the reference distribution $\gamma$ and push-forward parameter $t$, along with complexity and attack-defense analyses. Empirically, TriangleWad delivers competitive accuracy with reduced computation on image and text tasks, and it enables broader applications in labeled-data distances, data valuation, noisy-data detection, and multi-source distance computation, with demonstrated utility in FL and data marketplaces. The work offers a practical, privacy-conscious mechanism for distributional similarity assessment that scales to real-world privacy-sensitive settings.
Abstract
Wasserstein distance is a key metric for quantifying data divergence from a distributional perspective. However, its application in privacy-sensitive environments, where direct sharing of raw data is prohibited, presents significant challenges. Existing approaches, such as Differential Privacy and Federated Optimization, have been employed to estimate the Wasserstein distance under such constraints. However, these methods often fall short when both accuracy and security are required. In this study, we explore the inherent triangular properties within the Wasserstein space, leading to a novel solution named TriangleWad. This approach facilitates the fast computation of the Wasserstein distance between datasets stored across different entities, ensuring that raw data remain completely hidden. TriangleWad not only strengthens resistance to potential attacks but also preserves high estimation accuracy. Through extensive experiments across various tasks involving both image and text data, we demonstrate its superior performance and significant potential for real-world applications.
