Are Data Experts Buying into Differentially Private Synthetic Data? Gathering Community Perspectives
Lucas Rosenblatt, Bill Howe, Julia Stoyanovich
TL;DR
This qualitative study investigates data experts' views on differentially private synthetic data (DP SD) through 17 semi-structured interviews across academia and industry. It finds that DP SD is rarely adopted as a first resort, with enthusiasm tempered by epistemic, trust, and validation concerns, and by a lack of partner-vetted use cases. The authors propose three concrete recommendations: require evidence of validation in at least one partner-vetted use case; publish organization-wide standards of evidence for DP data usage; and implement a tiered access model to sensitive data (a driver’s license approach) to balance exploration with privacy risk. The work highlights the need for context-aware, human-centered DP tools and governance, including clear discussion of the DP parameter $ abla$ (represented as $\epsilon$ in notation) and practical utility considerations, to bridge theory and practice in data sharing.
Abstract
Data privacy is a core tenet of responsible computing, and in the United States, differential privacy (DP) is the dominant technical operationalization of privacy-preserving data analysis. With this study, we qualitatively examine one class of DP mechanisms: private data synthesizers. To that end, we conducted semi-structured interviews with data experts: academics and practitioners who regularly work with data. Broadly, our findings suggest that quantitative DP benchmarks must be grounded in practitioner needs, while communication challenges persist. Participants expressed a need for context-aware DP solutions, focusing on parity between research outcomes on real and synthetic data. Our analysis led to three recommendations: (1) improve existing insufficient sanitized benchmarks; successful DP implementations require well-documented, partner-vetted use cases, (2) organizations using DP synthetic data should publish discipline-specific standards of evidence, and (3) tiered data access models could allow researchers to gradually access sensitive data based on demonstrated competence with high-privacy, low-fidelity synthetic data.
