Table of Contents
Fetching ...

Assessing Utility of Differential Privacy for RCTs

Kaitlyn R. Webb, Soumya Mukherjee, Aratrika Mustafi, Aleksandra Slavković, Lars Vilhuber

TL;DR

This work tackles the challenge of sharing RCT data while protecting respondent privacy by evaluating three differential-privacy–inspired mechanisms built around a perturbed multivariate histogram. It develops a model-agnostic MV Histogram method and two model-informed approaches (Hybrid and GenModel), assessing their ability to preserve inference on treatment effects under privacy budgets. Through simulations and a real-world Liberia Reducecrime application, the Hybrid method consistently yields strong CI overlap with the confidential results, while MV Histogram offers robust utility with many covariates; GenModel can suffer from unstable standard errors or outliers, especially with limited budgets or high model complexity. The findings support privacy-preserving replication packages as feasible tools for sharing RCT data in privacy-sensitive settings, with practical guidance on method choice given covariate dimensionality and model fidelity.

Abstract

Randomized controlled trials (RCTs) have become powerful tools for assessing the impact of interventions and policies in many contexts. They are considered the gold standard for causal inference in the biomedical fields and many social sciences. Researchers have published an increasing number of studies that rely on RCTs for at least part of their inference. These studies typically include the response data that has been collected, de-identified, and sometimes protected through traditional disclosure limitation methods. In this paper, we empirically assess the impact of privacy-preserving synthetic data generation methodologies on published RCT analyses by leveraging available replication packages (research compendia) in economics and policy analysis. We implement three privacy-preserving algorithms, that use as a base one of the basic differentially private (DP) algorithms, the perturbed histogram, to support the quality of statistical inference. We highlight challenges with the straight use of this algorithm and the stability-based histogram in our setting and described the adjustments needed. We provide simulation studies and demonstrate that we can replicate the analysis in a published economics article on privacy-protected data under various parameterizations. We find that relatively straightforward (at a high-level) privacy-preserving methods influenced by DP techniques allow for inference-valid protection of published data. The results have applicability to researchers wishing to share RCT data, especially in the context of low- and middle-income countries, with strong privacy protection.

Assessing Utility of Differential Privacy for RCTs

TL;DR

This work tackles the challenge of sharing RCT data while protecting respondent privacy by evaluating three differential-privacy–inspired mechanisms built around a perturbed multivariate histogram. It develops a model-agnostic MV Histogram method and two model-informed approaches (Hybrid and GenModel), assessing their ability to preserve inference on treatment effects under privacy budgets. Through simulations and a real-world Liberia Reducecrime application, the Hybrid method consistently yields strong CI overlap with the confidential results, while MV Histogram offers robust utility with many covariates; GenModel can suffer from unstable standard errors or outliers, especially with limited budgets or high model complexity. The findings support privacy-preserving replication packages as feasible tools for sharing RCT data in privacy-sensitive settings, with practical guidance on method choice given covariate dimensionality and model fidelity.

Abstract

Randomized controlled trials (RCTs) have become powerful tools for assessing the impact of interventions and policies in many contexts. They are considered the gold standard for causal inference in the biomedical fields and many social sciences. Researchers have published an increasing number of studies that rely on RCTs for at least part of their inference. These studies typically include the response data that has been collected, de-identified, and sometimes protected through traditional disclosure limitation methods. In this paper, we empirically assess the impact of privacy-preserving synthetic data generation methodologies on published RCT analyses by leveraging available replication packages (research compendia) in economics and policy analysis. We implement three privacy-preserving algorithms, that use as a base one of the basic differentially private (DP) algorithms, the perturbed histogram, to support the quality of statistical inference. We highlight challenges with the straight use of this algorithm and the stability-based histogram in our setting and described the adjustments needed. We provide simulation studies and demonstrate that we can replicate the analysis in a published economics article on privacy-protected data under various parameterizations. We find that relatively straightforward (at a high-level) privacy-preserving methods influenced by DP techniques allow for inference-valid protection of published data. The results have applicability to researchers wishing to share RCT data, especially in the context of low- and middle-income countries, with strong privacy protection.
Paper Structure (27 sections, 10 equations, 19 figures, 19 tables, 5 algorithms)

This paper contains 27 sections, 10 equations, 19 figures, 19 tables, 5 algorithms.

Figures (19)

  • Figure 1: Both model-informed methods of generating synthetic data follow the same five overall steps.
  • Figure 2: GenModel from Algorithm \ref{['algo:fullmodelbased']} is visualized above. Yellow represents parameters and model inputs. Red indicates the output of a step contains confidential information. Blue indicates the output satisfies approximate DP if the multivariate histogram satisfies approximate DP.
  • Figure 3: The GenModel and MV Histogram sanitized treatment effects take a wider range of values across the 20 repetitions while the Hybrid sanitized treatment effects are more concentrated near the confidential estimated treatment effect (red dashed line). The outliers have been removed here, but plots including outlier can be found in Figure \ref{['fig:apdx-sim-privbudget']} in Appendix \ref{['sec:appendix-plots']}
  • Figure 4: The distribution of CI overlap values for Models 1 to 9 based on the number and type of covariates for various sanitizing methods.
  • Figure 5: Estimated treatment effects by sanitization method with the confidential estimator (black triangle) marked for each model. The MV Histogram and Hybrid results use $\epsilon=1$. The GenM-2 method uses $\epsilon=2$, which fixes the $\epsilon_{\mathbf{X}}=1$ equal to the other methods.
  • ...and 14 more figures