Table of Contents
Fetching ...

An applied Perspective: Estimating the Differential Identifiability Risk of an Exemplary SOEP Data Set

Jonas Allmann, Saskia Nuñez von Voigt, Florian Tschorsch

TL;DR

This paper investigates how to quantify identifiability risk under differential privacy for a real-world, large-scale data set (SOEP). It extends the Lee and Clifton risk framework by incorporating local and global sensitivities and introduces a new two-worlds risk metric, comparing it to the existing many-worlds and a simplified worst-case bound. Through an empirical evaluation on the SOEP data with multiple query types and privacy levels, the authors show that risk estimates depend strongly on data characteristics, query type, and sample size, with the two-worlds metric providing a more conservative and robust assessment than many-worlds in practice. The work highlights practical challenges in communicating DP risks, suggests risk budgeting and system comparisons as viable approaches, and points to directions for more granular, data-aware risk calibration and explanation formats for end users.

Abstract

Using real-world study data usually requires contractual agreements where research results may only be published in anonymized form. Requiring formal privacy guarantees, such as differential privacy, could be helpful for data-driven projects to comply with data protection. However, deploying differential privacy in consumer use cases raises the need to explain its underlying mechanisms and the resulting privacy guarantees. In this paper, we thoroughly review and extend an existing privacy metric. We show how to compute this risk metric efficiently for a set of basic statistical queries. Our empirical analysis based on an extensive, real-world scientific data set expands the knowledge on how to compute risks under realistic conditions, while presenting more challenges than solutions.

An applied Perspective: Estimating the Differential Identifiability Risk of an Exemplary SOEP Data Set

TL;DR

This paper investigates how to quantify identifiability risk under differential privacy for a real-world, large-scale data set (SOEP). It extends the Lee and Clifton risk framework by incorporating local and global sensitivities and introduces a new two-worlds risk metric, comparing it to the existing many-worlds and a simplified worst-case bound. Through an empirical evaluation on the SOEP data with multiple query types and privacy levels, the authors show that risk estimates depend strongly on data characteristics, query type, and sample size, with the two-worlds metric providing a more conservative and robust assessment than many-worlds in practice. The work highlights practical challenges in communicating DP risks, suggests risk budgeting and system comparisons as viable approaches, and points to directions for more granular, data-aware risk calibration and explanation formats for end users.

Abstract

Using real-world study data usually requires contractual agreements where research results may only be published in anonymized form. Requiring formal privacy guarantees, such as differential privacy, could be helpful for data-driven projects to comply with data protection. However, deploying differential privacy in consumer use cases raises the need to explain its underlying mechanisms and the resulting privacy guarantees. In this paper, we thoroughly review and extend an existing privacy metric. We show how to compute this risk metric efficiently for a set of basic statistical queries. Our empirical analysis based on an extensive, real-world scientific data set expands the knowledge on how to compute risks under realistic conditions, while presenting more challenges than solutions.
Paper Structure (19 sections, 6 equations, 2 figures, 2 tables)

This paper contains 19 sections, 6 equations, 2 figures, 2 tables.

Figures (2)

  • Figure 1: Comparison of many-worlds risk and two-worlds risk with $\varepsilon=1$.
  • Figure 2: Risks with varying sample proportion and $\varepsilon$ for max query and variable distance work.