Table of Contents
Fetching ...

The Complexities of Differential Privacy for Survey Data

Jörg Drechsler, James Bailie

TL;DR

The paper analyzes applying differential privacy to survey data, identifying five aspects that complicate adoption: multistage data production, complex sampling amplification, weighting adjustments, and imputation. It surveys integrating DP into the survey pipeline, notes that amplification from complex designs is limited and depends on the design, and analyzes the sensitivity of weighted estimators under DP, offering DP-friendly strategies and two DP-compatible imputation approaches. It reviews results on privacy amplification from sampling and the impact on Horvitz-Thompson estimators, and proposes pragmatic modifications to DP to balance privacy, utility, and implementability in agency settings. It concludes that practical deployment requires careful specification of invariants and pipeline steps and calls for principled risk-utility analyses and further research to enable DP-enabled, high-utility public data products.

Abstract

The concept of differential privacy (DP) has gained substantial attention in recent years, most notably since the U.S. Census Bureau announced the adoption of the concept for its 2020 Decennial Census. However, despite its attractive theoretical properties, implementing DP in practice remains challenging, especially when it comes to survey data. In this paper we present some results from an ongoing project funded by the U.S. Census Bureau that is exploring the possibilities and limitations of DP for survey data. Specifically, we identify five aspects that need to be considered when adopting DP in the survey context: the multi-staged nature of data production; the limited privacy amplification from complex sampling designs; the implications of survey-weighted estimates; the weighting adjustments for nonresponse and other data deficiencies, and the imputation of missing values. We summarize the project's key findings with respect to each of these aspects and also discuss some of the challenges that still need to be addressed before DP could become the new data protection standard at statistical agencies.

The Complexities of Differential Privacy for Survey Data

TL;DR

The paper analyzes applying differential privacy to survey data, identifying five aspects that complicate adoption: multistage data production, complex sampling amplification, weighting adjustments, and imputation. It surveys integrating DP into the survey pipeline, notes that amplification from complex designs is limited and depends on the design, and analyzes the sensitivity of weighted estimators under DP, offering DP-friendly strategies and two DP-compatible imputation approaches. It reviews results on privacy amplification from sampling and the impact on Horvitz-Thompson estimators, and proposes pragmatic modifications to DP to balance privacy, utility, and implementability in agency settings. It concludes that practical deployment requires careful specification of invariants and pipeline steps and calls for principled risk-utility analyses and further research to enable DP-enabled, high-utility public data products.

Abstract

The concept of differential privacy (DP) has gained substantial attention in recent years, most notably since the U.S. Census Bureau announced the adoption of the concept for its 2020 Decennial Census. However, despite its attractive theoretical properties, implementing DP in practice remains challenging, especially when it comes to survey data. In this paper we present some results from an ongoing project funded by the U.S. Census Bureau that is exploring the possibilities and limitations of DP for survey data. Specifically, we identify five aspects that need to be considered when adopting DP in the survey context: the multi-staged nature of data production; the limited privacy amplification from complex sampling designs; the implications of survey-weighted estimates; the weighting adjustments for nonresponse and other data deficiencies, and the imputation of missing values. We summarize the project's key findings with respect to each of these aspects and also discuss some of the challenges that still need to be addressed before DP could become the new data protection standard at statistical agencies.
Paper Structure (9 sections, 2 figures)

This paper contains 9 sections, 2 figures.

Figures (2)

  • Figure 2.1: A survey pipeline consists of multiple steps, of which some of the most important are: determining the target population to be studied; constructing the frame; drawing the sample; collecting survey data from the responding units; processing the data (including coding free-form responses; editing inconsistent or improbable data; imputing missing records or variables; calculating the survey weights; and injecting privacy-protecting noise); and computing the survey outputs. There are of course additional steps to a survey pipeline after the survey outputs are released (such as data analysis) but, as they are not important to this paper's subject, we exclude these steps from discussion. While not shown in this figure, it should be noted that data from previous stages of a pipeline are often used in later stages. (For example, the frame is usually used in computing the survey weights during the production of the processed data.)
  • Figure 2.2: Three examples of where to start the data-release mechanism (circled in red) in the survey pipeline and which of the previous stages to take as invariant (those stages before the pipeline branches). Recall from Figure \ref{['figPipeline']} that $\mathfrak{p}$ denotes the population, $\mathfrak{f}$ the frame, $\mathfrak{s}$ the sample, $\mathfrak{r}$ the responding sample, $\mathfrak{d}$ the processed data and $t$ the survey outputs. The apostrophe $'$ indicates an alternative realisation of the associated variable. Figure (a) illustrates the standard approach in which there are no invariants and the data-release mechanism only executes the final step of the survey pipeline--transforming the processed data into the survey outputs. In Figure (b), the mechanism begins with the frame and includes the sampling, responding and processing steps. The population is considered invariant. In Figure (c), the mechanism takes as input the responding sample. Both the population and the frame are taken as invariant, so that DP only compares samples from the same frame. This reduces the sensitivity of weighted estimators at the expense of reduced privacy (Section \ref{['sectionSurveyWeights']}).