Table of Contents
Fetching ...

Improving the Variance of Differentially Private Randomized Experiments through Clustering

Adel Javanmard, Vahab Mirrokni, Jean Pouget-Abadie

TL;DR

This work tackles the problem of estimating causal effects under differential privacy by exploiting non-private clustering structure to reduce the variance penalty induced by DP noise. It introduces Cluster-DP, a cluster-aware DP mechanism, and its Cluster-Free variant, along with an efficient unbiased estimator $\hat{\tau}_Q$ that debiases privatized outcomes using the inverse of a cluster-specific routing matrix $Q_{c,a}$. Theoretical results establish precise DP guarantees and variance bounds that depend on cluster quality via hydro-metric terms $\\phi_a$, demonstrating improved privacy-utility trade-offs when clusters are homogeneous. Empirically, Cluster-DP consistently achieves lower variance than baselines (Uniform-Prior-DP and Cluster-Free-DP) across synthetic models and a real YouTube network dataset, highlighting practical impact for privacy-preserving causal analysis in ads and other clustered settings.

Abstract

Estimating causal effects from randomized experiments is only possible if participants are willing to disclose their potentially sensitive responses. Differential privacy, a widely used framework for ensuring an algorithms privacy guarantees, can encourage participants to share their responses without the risk of de-anonymization. However, many mechanisms achieve differential privacy by adding noise to the original dataset, which reduces the precision of causal effect estimation. This introduces a fundamental trade-off between privacy and variance when performing causal analyses on differentially private data. In this work, we propose a new differentially private mechanism, "Cluster-DP", which leverages a given cluster structure in the data to improve the privacy-variance trade-off. While our results apply to any clustering, we demonstrate that selecting higher-quality clusters, according to a quality metric we introduce, can decrease the variance penalty without compromising privacy guarantees. Finally, we evaluate the theoretical and empirical performance of our Cluster-DP algorithm on both real and simulated data, comparing it to common baselines, including two special cases of our algorithm: its unclustered version and a uniform-prior version.

Improving the Variance of Differentially Private Randomized Experiments through Clustering

TL;DR

This work tackles the problem of estimating causal effects under differential privacy by exploiting non-private clustering structure to reduce the variance penalty induced by DP noise. It introduces Cluster-DP, a cluster-aware DP mechanism, and its Cluster-Free variant, along with an efficient unbiased estimator that debiases privatized outcomes using the inverse of a cluster-specific routing matrix . Theoretical results establish precise DP guarantees and variance bounds that depend on cluster quality via hydro-metric terms , demonstrating improved privacy-utility trade-offs when clusters are homogeneous. Empirically, Cluster-DP consistently achieves lower variance than baselines (Uniform-Prior-DP and Cluster-Free-DP) across synthetic models and a real YouTube network dataset, highlighting practical impact for privacy-preserving causal analysis in ads and other clustered settings.

Abstract

Estimating causal effects from randomized experiments is only possible if participants are willing to disclose their potentially sensitive responses. Differential privacy, a widely used framework for ensuring an algorithms privacy guarantees, can encourage participants to share their responses without the risk of de-anonymization. However, many mechanisms achieve differential privacy by adding noise to the original dataset, which reduces the precision of causal effect estimation. This introduces a fundamental trade-off between privacy and variance when performing causal analyses on differentially private data. In this work, we propose a new differentially private mechanism, "Cluster-DP", which leverages a given cluster structure in the data to improve the privacy-variance trade-off. While our results apply to any clustering, we demonstrate that selecting higher-quality clusters, according to a quality metric we introduce, can decrease the variance penalty without compromising privacy guarantees. Finally, we evaluate the theoretical and empirical performance of our Cluster-DP algorithm on both real and simulated data, comparing it to common baselines, including two special cases of our algorithm: its unclustered version and a uniform-prior version.
Paper Structure (37 sections, 14 theorems, 111 equations, 8 figures, 2 tables, 4 algorithms)

This paper contains 37 sections, 14 theorems, 111 equations, 8 figures, 2 tables, 4 algorithms.

Key Result

Theorem 3.1

Let $\tilde{{\varepsilon}}>0$ and $\delta := \max(0, 1-\lambda+\lambda \gamma (1-e^{\tilde{{\varepsilon}}}))\,$. The Cluster-DP mechanism described in Algorithm alg1 is $({\varepsilon},\delta)$-label DP with ${\varepsilon} = \min\left(\frac{1}{\sigma},\frac{2}{\gamma}\right) +\tilde{{\varepsilon}}\,

Figures (8)

  • Figure 1: Illustration of Cluster-DP mechanism with a central unit computing the (clustered) privatized outcomes for valid causal inference.
  • Figure 2: Fuller figures can be found in Appendix \ref{['sec:supplement:fuller-figures']}. (1.a) Variances of each mechanism as we vary the truncation level $\gamma \in [0.1/K,1/K]$ in Experiment 1. Privacy loss fixed at ${\varepsilon} =0.2$ and $\delta = 10^{-4}$. (1.b) Privacy-variance trade-off of each mechanism under the setting of Experiment 1. We fix the DP failure probability to $\delta = 10^{-4}$, and optimize the choice of $\sigma$ and $\gamma$ in the sets $\sigma\in\{10,20,\infty\}$ and $\gamma\in\{0.01/K,0.1/K, 1/K\}$. (1.c) Ratio of the variance of the estimators under the cluster-DP and cluster free-DP mechanisms in Experiment 2. The benefit of cluster-DP mechanism is stronger at larger $\beta$ and smaller value of $\lambda$. (1.d) Privacy-variance trade-off of the Cluster-DP and Cluster free-DP stratified estimators for the YouTube dataset in Experiment 3. The dotted line is the variance of the non-private stratified estimator.
  • Figure 3: histogram of $\hat{\tau}-\tau$
  • Figure 4: qq-plot of $\hat{\tau} - \tau$
  • Figure 5: The variance gap between the private estimator $\hat{\tau}_Q$, given given in Theorem \ref{['thm:unbiased_consistent_estimator']}, and the non-private estimator $\hat{\tau}_{\textsc{No-DP}}$ in the setting of Experiment 5. The upper boundary of the shaded area corresponds to the upper bound derived in Theorem \ref{['thm:variance_upper_bound']}, and it lower boundary corresponds to the first term in that bound. As we see the gap remains between the two boundaries.
  • ...and 3 more figures

Theorems & Definitions (21)

  • Remark 2.1
  • Definition 2.2
  • Theorem 3.1
  • Theorem 3.2
  • Remark 3.3
  • Definition 3.4: Cluster homogeneity
  • Theorem 3.5
  • Definition 1.1
  • Proposition 1.2
  • Proposition 1.3
  • ...and 11 more