Practical privacy metrics for synthetic data
Gillian M Raab, Beata Nowok, Chris Dibben
TL;DR
The paper addresses the lack of standardized disclosure-risk metrics for synthetic data by extending the synthpop R package with tools to quantify identity and attribute disclosure risks for records used to create synthetic data. It introduces two core measures, RepU for identity disclosure and DiSCO for attribute disclosure, along with related metrics (UiO, UiS, UiOiS, Dorig, DCAP, TCAP) to enable comprehensive risk assessment and comparisons against the original data. Through illustrative examples, it demonstrates how to identify risky one-way and two-way relationships and how exclusions can mitigate apparent disclosures, highlighting the importance of context and data structure in privacy evaluation. The proposed approach offers practical methods for data custodians to evaluate and manage disclosure risk when releasing synthetic data, complementing broader privacy frameworks while emphasizing user-driven specification of keys and intruder knowledge.
Abstract
This paper explains how the synthpop package for R has been extended to include functions to calculate measures of identity and attribute disclosure risk for synthetic data that measure risks for the records used to create the synthetic data. The basic function, disclosure, calculates identity disclosure for a set of quasi-identifiers (keys) and attribute disclosure for one variable specified as a target from the same set of keys. The second function, disclosure.summary, is a wrapper for the first and presents summary results for a set of targets. This short paper explains the measures of disclosure risk and documents how they are calculated. We recommend two measures: $RepU$ (replicated uniques) for identity disclosure and $DiSCO$ (Disclosive in Synthetic Correct Original) for attribute disclosure. Both are expressed a \% of the original records and each can be compared to similar measures calculated from the original data. Experience with using the functions on real data found that some apparent disclosures could be identified as coming from relationships in the data that would be expected to be known to anyone familiar with its features. We flag cases when this seems to have occurred and provide means of excluding them.
