Table of Contents
Fetching ...

Are Data Experts Buying into Differentially Private Synthetic Data? Gathering Community Perspectives

Lucas Rosenblatt, Bill Howe, Julia Stoyanovich

TL;DR

This qualitative study investigates data experts' views on differentially private synthetic data (DP SD) through 17 semi-structured interviews across academia and industry. It finds that DP SD is rarely adopted as a first resort, with enthusiasm tempered by epistemic, trust, and validation concerns, and by a lack of partner-vetted use cases. The authors propose three concrete recommendations: require evidence of validation in at least one partner-vetted use case; publish organization-wide standards of evidence for DP data usage; and implement a tiered access model to sensitive data (a driver’s license approach) to balance exploration with privacy risk. The work highlights the need for context-aware, human-centered DP tools and governance, including clear discussion of the DP parameter $ abla$ (represented as $\epsilon$ in notation) and practical utility considerations, to bridge theory and practice in data sharing.

Abstract

Data privacy is a core tenet of responsible computing, and in the United States, differential privacy (DP) is the dominant technical operationalization of privacy-preserving data analysis. With this study, we qualitatively examine one class of DP mechanisms: private data synthesizers. To that end, we conducted semi-structured interviews with data experts: academics and practitioners who regularly work with data. Broadly, our findings suggest that quantitative DP benchmarks must be grounded in practitioner needs, while communication challenges persist. Participants expressed a need for context-aware DP solutions, focusing on parity between research outcomes on real and synthetic data. Our analysis led to three recommendations: (1) improve existing insufficient sanitized benchmarks; successful DP implementations require well-documented, partner-vetted use cases, (2) organizations using DP synthetic data should publish discipline-specific standards of evidence, and (3) tiered data access models could allow researchers to gradually access sensitive data based on demonstrated competence with high-privacy, low-fidelity synthetic data.

Are Data Experts Buying into Differentially Private Synthetic Data? Gathering Community Perspectives

TL;DR

This qualitative study investigates data experts' views on differentially private synthetic data (DP SD) through 17 semi-structured interviews across academia and industry. It finds that DP SD is rarely adopted as a first resort, with enthusiasm tempered by epistemic, trust, and validation concerns, and by a lack of partner-vetted use cases. The authors propose three concrete recommendations: require evidence of validation in at least one partner-vetted use case; publish organization-wide standards of evidence for DP data usage; and implement a tiered access model to sensitive data (a driver’s license approach) to balance exploration with privacy risk. The work highlights the need for context-aware, human-centered DP tools and governance, including clear discussion of the DP parameter (represented as in notation) and practical utility considerations, to bridge theory and practice in data sharing.

Abstract

Data privacy is a core tenet of responsible computing, and in the United States, differential privacy (DP) is the dominant technical operationalization of privacy-preserving data analysis. With this study, we qualitatively examine one class of DP mechanisms: private data synthesizers. To that end, we conducted semi-structured interviews with data experts: academics and practitioners who regularly work with data. Broadly, our findings suggest that quantitative DP benchmarks must be grounded in practitioner needs, while communication challenges persist. Participants expressed a need for context-aware DP solutions, focusing on parity between research outcomes on real and synthetic data. Our analysis led to three recommendations: (1) improve existing insufficient sanitized benchmarks; successful DP implementations require well-documented, partner-vetted use cases, (2) organizations using DP synthetic data should publish discipline-specific standards of evidence, and (3) tiered data access models could allow researchers to gradually access sensitive data based on demonstrated competence with high-privacy, low-fidelity synthetic data.

Paper Structure

This paper contains 25 sections, 1 equation, 3 figures, 12 tables.

Figures (3)

  • Figure 1: Materials used in Steps 3--5 of the interview. In the first part of the slide prompts, participants were presented with sample data and asked about it's sensitivity. In the second part, they were presented with privacy terms to define, and then were given a definition. In the third part, they were shown histogram comparisons and scatterplot correlations between the real and fake data.
  • Figure 2: Absolute count of participants discussing general topics, out of a total of 17 participants. Presented overall, as well as sorted by participants with above average privacy priors ("High PP", $n=11$) and below average privacy priors ("Low PP", $n=6$). Most participants expressed a need or desire for differentially private synthetic data and carefully thought about the legal ramifications; many also expressed skepticism and a desire for real-data validation. However, only some participants had concrete experience working with synthetic data and none of the participants with a low PP score mentioned the U.S. Census use case.
  • Figure 3: Full slides used for participant prompting during interviews, abbreviated in Figure \ref{['fig:slides']}

Theorems & Definitions (2)

  • Definition 1: Data Expert
  • Definition 2: Differential Privacy dwork2014algorithmic