Assessing Utility of Differential Privacy for RCTs
Kaitlyn R. Webb, Soumya Mukherjee, Aratrika Mustafi, Aleksandra Slavković, Lars Vilhuber
TL;DR
This work tackles the challenge of sharing RCT data while protecting respondent privacy by evaluating three differential-privacy–inspired mechanisms built around a perturbed multivariate histogram. It develops a model-agnostic MV Histogram method and two model-informed approaches (Hybrid and GenModel), assessing their ability to preserve inference on treatment effects under privacy budgets. Through simulations and a real-world Liberia Reducecrime application, the Hybrid method consistently yields strong CI overlap with the confidential results, while MV Histogram offers robust utility with many covariates; GenModel can suffer from unstable standard errors or outliers, especially with limited budgets or high model complexity. The findings support privacy-preserving replication packages as feasible tools for sharing RCT data in privacy-sensitive settings, with practical guidance on method choice given covariate dimensionality and model fidelity.
Abstract
Randomized controlled trials (RCTs) have become powerful tools for assessing the impact of interventions and policies in many contexts. They are considered the gold standard for causal inference in the biomedical fields and many social sciences. Researchers have published an increasing number of studies that rely on RCTs for at least part of their inference. These studies typically include the response data that has been collected, de-identified, and sometimes protected through traditional disclosure limitation methods. In this paper, we empirically assess the impact of privacy-preserving synthetic data generation methodologies on published RCT analyses by leveraging available replication packages (research compendia) in economics and policy analysis. We implement three privacy-preserving algorithms, that use as a base one of the basic differentially private (DP) algorithms, the perturbed histogram, to support the quality of statistical inference. We highlight challenges with the straight use of this algorithm and the stability-based histogram in our setting and described the adjustments needed. We provide simulation studies and demonstrate that we can replicate the analysis in a published economics article on privacy-protected data under various parameterizations. We find that relatively straightforward (at a high-level) privacy-preserving methods influenced by DP techniques allow for inference-valid protection of published data. The results have applicability to researchers wishing to share RCT data, especially in the context of low- and middle-income countries, with strong privacy protection.
