Table of Contents
Fetching ...

"Having Confidence in My Confidence Intervals": How Data Users Engage with Privacy-Protected Wikipedia Data

Harold Triedman, Jayshree Sarathy, Priyanka Nanayakkara, Rachel Cummings, Gabriel Kaptchuk, Sean Kross, Elissa M. Redmiles

TL;DR

The paper examines how data users engage with privacy-noised Wikipedia pageview data produced by rounding versus differential privacy (DP). It combines expert-informed documentation with a task-based contextual inquiry of 15 data scientists to reveal how users interpret noise, compute uncertainty, and choose data depending on audience. Key findings show that rounding is easier to understand but less informative for uncertainty, while DP enables simulation-based uncertainty but complicates cross-peak comparisons; many users misinterpret privacy strength in relation to utility. The work provides design recommendations for documentation and tooling to improve usability of privacy-noised datasets and highlights the need for audience-tailored communication research.

Abstract

In response to calls for open data and growing privacy threats, organizations are increasingly adopting privacy-preserving techniques such as differential privacy (DP) that inject statistical noise when generating published datasets. These techniques are designed to protect privacy of data subjects while enabling useful analyses, but their reception by data users is under-explored. We developed documentation that presents the noise characteristics of two Wikipedia pageview datasets: one using rounding (heuristic privacy) and another using DP (formal privacy). After incorporating expert feedback (n=5), we used these documents to conduct a task-based contextual inquiry (n=15) exploring how data users--largely unfamiliar with these methods--perceive, interact with, and interpret privacy-preserving noise during data analysis. Participants readily used simple uncertainty metrics from the documentation, but struggled when asked to compute confidence intervals across multiple noisy estimates. They were better able to devise simulation-based approaches for computing uncertainty with DP data compared to rounded data. Surprisingly, several participants incorrectly believed DP's stronger utility implied weaker privacy protections. Based on our findings, we offer design recommendations for documentation and tools to better support data users working with privacy-noised data.

"Having Confidence in My Confidence Intervals": How Data Users Engage with Privacy-Protected Wikipedia Data

TL;DR

The paper examines how data users engage with privacy-noised Wikipedia pageview data produced by rounding versus differential privacy (DP). It combines expert-informed documentation with a task-based contextual inquiry of 15 data scientists to reveal how users interpret noise, compute uncertainty, and choose data depending on audience. Key findings show that rounding is easier to understand but less informative for uncertainty, while DP enables simulation-based uncertainty but complicates cross-peak comparisons; many users misinterpret privacy strength in relation to utility. The work provides design recommendations for documentation and tooling to improve usability of privacy-noised datasets and highlights the need for audience-tailored communication research.

Abstract

In response to calls for open data and growing privacy threats, organizations are increasingly adopting privacy-preserving techniques such as differential privacy (DP) that inject statistical noise when generating published datasets. These techniques are designed to protect privacy of data subjects while enabling useful analyses, but their reception by data users is under-explored. We developed documentation that presents the noise characteristics of two Wikipedia pageview datasets: one using rounding (heuristic privacy) and another using DP (formal privacy). After incorporating expert feedback (n=5), we used these documents to conduct a task-based contextual inquiry (n=15) exploring how data users--largely unfamiliar with these methods--perceive, interact with, and interpret privacy-preserving noise during data analysis. Participants readily used simple uncertainty metrics from the documentation, but struggled when asked to compute confidence intervals across multiple noisy estimates. They were better able to devise simulation-based approaches for computing uncertainty with DP data compared to rounded data. Surprisingly, several participants incorrectly believed DP's stronger utility implied weaker privacy protections. Based on our findings, we offer design recommendations for documentation and tools to better support data users working with privacy-noised data.

Paper Structure

This paper contains 48 sections, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Documentation for Pageviews Global Dataset, where pageviews are reported at a global level with no privacy protections applied.
  • Figure 2: Screenshots of documentation for Pageviews by Country Dataset, where pageviews are reported at a country level using DP.
  • Figure 3: Screenshots of documentation for Pageviews by Country Dataset, where pageviews are reported at a country level using Rounding.
  • Figure 4: Visual representations of the three tasks we asked participants to complete.
  • Figure 5: Visualization included in our documentation of the DP process used by WMF to report Wikipedia pageviews by country.
  • ...and 1 more figures