Towards more accurate and useful data anonymity vulnerability measures
Paul Francis, David Wagner
TL;DR
This work tackles the problem of measuring data anonymity vulnerabilities in structured data by criticizing prevalent evaluation practices. It introduces the non-member framework to compute a privacy-neutral baseline of inferred attributes from individuals not in the dataset, enabling a principled comparison against attack performance using precision $P$ and coverage $C$ (or recall-like measures) and the derived metric $PI^{\mathbb{A}_i}$. By analyzing canonical attacks (e.g., US Census reconstruction, location traces, and ML-model inversions), the authors show that many claimed vulnerabilities rely on inappropriate baselines or unrealistic base rates, and that reporting should include representative base rates to avoid overestimating risk. The paper also discusses GDPR alignment, mitigation of dependence between members and non-members, and contrasts the non-member framework with prior work, arguing for more accurate, transparent risk assessment and utility-preserving anonymization practices. Overall, the proposed framework aims to calibrate vulnerability assessments, reduce the overstatement of risk, and provide practical guidance for reporting and policy. The work emphasizes the need for future tooling and broader empirical validation to operationalize these ideas across diverse data releases.
Abstract
The purpose of anonymizing structured data is to protect the privacy of individuals in the data while retaining the statistical properties of the data. There is a large body of work that examines anonymization vulnerabilities. Focusing on strong anonymization mechanisms, this paper examines a number of prominent attack papers and finds several problems, all of which lead to overstating risk. First, some papers fail to establish a correct statistical inference baseline (or any at all), leading to incorrect measures. Notably, the reconstruction attack from the US Census Bureau that led to a redesign of its disclosure method made this mistake. We propose the non-member framework, an improved method for how to compute a more accurate inference baseline, and give examples of its operation. Second, some papers don't use a realistic membership base rate, leading to incorrect precision measures if precision is reported. Third, some papers unnecessarily report measures in such a way that it is difficult or impossible to assess risk. Virtually the entire literature on membership inference attacks, dozens of papers, make one or both of these errors. We propose that membership inference papers report precision/recall values using a representative range of base rates.
