Table of Contents
Fetching ...

Differentially Private Distribution Estimation Using Functional Approximation

Ye Tao, Anand D. Sarwate

TL;DR

This work tackles private CDF estimation by projecting the empirical CDF into a finite Legendre polynomial space and privatizing the projection coefficients via the functional mechanism, achieving $(\epsilon,\delta)$-DP. The core contribution, Polynomial Projection (PP), provides a principled balance between accuracy and privacy, with explicit $L_2$ error bounds and favorable behavior in decentralized and incremental-data scenarios. Empirical results show PP is competitive with adaptive quantiles in moderate privacy and outperforms histogram queries, while offering practical advantages for distributed settings and privacy-preserving visualizations. The approach opens avenues for alternative function spaces and high-dimensional extensions, along with a deeper examination of post-processing utilities.

Abstract

The cumulative distribution function (CDF) is fundamental due to its ability to reveal information about random variables, making it essential in studies that require privacy-preserving methods to protect sensitive data. This paper introduces a novel privacy-preserving CDF method inspired by the functional analysis and functional mechanism. Our approach projects the empirical CDF into a predefined space, approximating it using specific functions, and protects the coefficients to achieve a differentially private empirical CDF. Compared to existing methods like histogram queries and adaptive quantiles, our method is preferable in decentralized settings and scenarios where CDFs must be updated with newly collected data.

Differentially Private Distribution Estimation Using Functional Approximation

TL;DR

This work tackles private CDF estimation by projecting the empirical CDF into a finite Legendre polynomial space and privatizing the projection coefficients via the functional mechanism, achieving -DP. The core contribution, Polynomial Projection (PP), provides a principled balance between accuracy and privacy, with explicit error bounds and favorable behavior in decentralized and incremental-data scenarios. Empirical results show PP is competitive with adaptive quantiles in moderate privacy and outperforms histogram queries, while offering practical advantages for distributed settings and privacy-preserving visualizations. The approach opens avenues for alternative function spaces and high-dimensional extensions, along with a deeper examination of post-processing utilities.

Abstract

The cumulative distribution function (CDF) is fundamental due to its ability to reveal information about random variables, making it essential in studies that require privacy-preserving methods to protect sensitive data. This paper introduces a novel privacy-preserving CDF method inspired by the functional analysis and functional mechanism. Our approach projects the empirical CDF into a predefined space, approximating it using specific functions, and protects the coefficients to achieve a differentially private empirical CDF. Compared to existing methods like histogram queries and adaptive quantiles, our method is preferable in decentralized settings and scenarios where CDFs must be updated with newly collected data.
Paper Structure (11 sections, 3 theorems, 15 equations, 8 figures)

This paper contains 11 sections, 3 theorems, 15 equations, 8 figures.

Key Result

Theorem 1

Let $F^*$ be the true CDF for a random variable with $x \in [-1,1]$. If $\check{F}$ is the optimal approximation of $F^*$ in the polynomial space $\mathcal{P}$ and $\|F^*-\check{F}\|_2 \leq \alpha$, then with probability at least $1-2 \exp(-\frac{N(\eta-\alpha)^2}{16}) - 2(K+1) \exp (-\frac{(\eta-\a

Figures (8)

  • Figure 1: Apply different methods with the Gaussian mechanism to normal distribution using the following parameters: $N=10^4, \epsilon=0.1, \delta=N^{-3/2}, K=6$. The bin number for HQ is set to $30$, and the number of iterations for AQ is $50$.
  • Figure 2: Comparison of distances between different DP CDF methods and the true CDF using various measurement methods: (a) Kolmogorov-Smirnov Distance; (b) Earth Mover's Distance; and (c) Energy Distance. The experiment was run $50$ times with $N=10^4, \delta=N^{-3/2}, K=6$. The bin number in HQ was set to $30$, and the number of iterations in AQ was $50$.
  • Figure 3: The experiment was run 50 times using a https://www.kaggle.com/datasets/burnoutminer/heights-and-weights-dataset with $N=25000$, $\delta=N^{-3/2}$, and $K=6$. The bin number in HQ was set to 30, and the number of iterations in AQ was 50. Distances were computed between the final DP CDF and the true CDF. (a) - (c) Decentralized setting with $10$ sites; (d) - (f) Newly collected data setting, where the CDF was updated after every $2500$ new data points, for a total of 10 rounds of updates.
  • Figure 4: The experiment was conducted with $N=10^4$, $\epsilon=1$ using data from a standard normal distribution.
  • Figure 4.1: Apply different methods with the Gaussian mechanism to various distributions using the following parameters: $N=10^4, \epsilon=0.1, \delta=N^{-3/2}, K=6$. The bin number for HQ is set to $30$, and the number of iterations for AQ is $50$.
  • ...and 3 more figures

Theorems & Definitions (7)

  • Definition 1: $(\epsilon, \delta)$-DP dwork2014algorithmic
  • Theorem 1: Upper Bound for $\|F^*-\tilde{F}\|_2$
  • proof
  • Remark 1
  • Theorem 2: The Classical Projection Theorem luenberger1997optimization
  • Theorem 3: Optimal Approximation of eCDF
  • proof