The Complexities of Differential Privacy for Survey Data
Jörg Drechsler, James Bailie
TL;DR
The paper analyzes applying differential privacy to survey data, identifying five aspects that complicate adoption: multistage data production, complex sampling amplification, weighting adjustments, and imputation. It surveys integrating DP into the survey pipeline, notes that amplification from complex designs is limited and depends on the design, and analyzes the sensitivity of weighted estimators under DP, offering DP-friendly strategies and two DP-compatible imputation approaches. It reviews results on privacy amplification from sampling and the impact on Horvitz-Thompson estimators, and proposes pragmatic modifications to DP to balance privacy, utility, and implementability in agency settings. It concludes that practical deployment requires careful specification of invariants and pipeline steps and calls for principled risk-utility analyses and further research to enable DP-enabled, high-utility public data products.
Abstract
The concept of differential privacy (DP) has gained substantial attention in recent years, most notably since the U.S. Census Bureau announced the adoption of the concept for its 2020 Decennial Census. However, despite its attractive theoretical properties, implementing DP in practice remains challenging, especially when it comes to survey data. In this paper we present some results from an ongoing project funded by the U.S. Census Bureau that is exploring the possibilities and limitations of DP for survey data. Specifically, we identify five aspects that need to be considered when adopting DP in the survey context: the multi-staged nature of data production; the limited privacy amplification from complex sampling designs; the implications of survey-weighted estimates; the weighting adjustments for nonresponse and other data deficiencies, and the imputation of missing values. We summarize the project's key findings with respect to each of these aspects and also discuss some of the challenges that still need to be addressed before DP could become the new data protection standard at statistical agencies.
