Table of Contents
Fetching ...

Towards more accurate and useful data anonymity vulnerability measures

Paul Francis, David Wagner

TL;DR

This work tackles the problem of measuring data anonymity vulnerabilities in structured data by criticizing prevalent evaluation practices. It introduces the non-member framework to compute a privacy-neutral baseline of inferred attributes from individuals not in the dataset, enabling a principled comparison against attack performance using precision $P$ and coverage $C$ (or recall-like measures) and the derived metric $PI^{\mathbb{A}_i}$. By analyzing canonical attacks (e.g., US Census reconstruction, location traces, and ML-model inversions), the authors show that many claimed vulnerabilities rely on inappropriate baselines or unrealistic base rates, and that reporting should include representative base rates to avoid overestimating risk. The paper also discusses GDPR alignment, mitigation of dependence between members and non-members, and contrasts the non-member framework with prior work, arguing for more accurate, transparent risk assessment and utility-preserving anonymization practices. Overall, the proposed framework aims to calibrate vulnerability assessments, reduce the overstatement of risk, and provide practical guidance for reporting and policy. The work emphasizes the need for future tooling and broader empirical validation to operationalize these ideas across diverse data releases.

Abstract

The purpose of anonymizing structured data is to protect the privacy of individuals in the data while retaining the statistical properties of the data. There is a large body of work that examines anonymization vulnerabilities. Focusing on strong anonymization mechanisms, this paper examines a number of prominent attack papers and finds several problems, all of which lead to overstating risk. First, some papers fail to establish a correct statistical inference baseline (or any at all), leading to incorrect measures. Notably, the reconstruction attack from the US Census Bureau that led to a redesign of its disclosure method made this mistake. We propose the non-member framework, an improved method for how to compute a more accurate inference baseline, and give examples of its operation. Second, some papers don't use a realistic membership base rate, leading to incorrect precision measures if precision is reported. Third, some papers unnecessarily report measures in such a way that it is difficult or impossible to assess risk. Virtually the entire literature on membership inference attacks, dozens of papers, make one or both of these errors. We propose that membership inference papers report precision/recall values using a representative range of base rates.

Towards more accurate and useful data anonymity vulnerability measures

TL;DR

This work tackles the problem of measuring data anonymity vulnerabilities in structured data by criticizing prevalent evaluation practices. It introduces the non-member framework to compute a privacy-neutral baseline of inferred attributes from individuals not in the dataset, enabling a principled comparison against attack performance using precision and coverage (or recall-like measures) and the derived metric . By analyzing canonical attacks (e.g., US Census reconstruction, location traces, and ML-model inversions), the authors show that many claimed vulnerabilities rely on inappropriate baselines or unrealistic base rates, and that reporting should include representative base rates to avoid overestimating risk. The paper also discusses GDPR alignment, mitigation of dependence between members and non-members, and contrasts the non-member framework with prior work, arguing for more accurate, transparent risk assessment and utility-preserving anonymization practices. Overall, the proposed framework aims to calibrate vulnerability assessments, reduce the overstatement of risk, and provide practical guidance for reporting and policy. The work emphasizes the need for future tooling and broader empirical validation to operationalize these ideas across diverse data releases.

Abstract

The purpose of anonymizing structured data is to protect the privacy of individuals in the data while retaining the statistical properties of the data. There is a large body of work that examines anonymization vulnerabilities. Focusing on strong anonymization mechanisms, this paper examines a number of prominent attack papers and finds several problems, all of which lead to overstating risk. First, some papers fail to establish a correct statistical inference baseline (or any at all), leading to incorrect measures. Notably, the reconstruction attack from the US Census Bureau that led to a redesign of its disclosure method made this mistake. We propose the non-member framework, an improved method for how to compute a more accurate inference baseline, and give examples of its operation. Second, some papers don't use a realistic membership base rate, leading to incorrect precision measures if precision is reported. Third, some papers unnecessarily report measures in such a way that it is difficult or impossible to assess risk. Virtually the entire literature on membership inference attacks, dozens of papers, make one or both of these errors. We propose that membership inference papers report precision/recall values using a representative range of base rates.
Paper Structure (21 sections, 6 equations, 10 figures)

This paper contains 21 sections, 6 equations, 10 figures.

Figures (10)

  • Figure 1: The US Census attack compared to simple inference from the per-block majority race and ethnicity francis2022census.
  • Figure 2: The non-member framework for computing allowed baselines. For any given combination of known attributes and secret attributes, the analysis that produces the best precision and coverage is used as the allowed baseline.
  • Figure 3: There are 6 categorical and 14 continuous attributes in BankChurners. Five attributes are PII.
  • Figure 4: Results for baseline precision achieved by simple ML predictions. The X axis is the target attribute. The features are either all attributes except the target, or only PII attributes. Prediction rate $PR_{base} = 1.0$.
  • Figure 5: Precision $P_{base}$ versus prediction rate $PR_{base}$ on the categorical variables of BankChurners. Each point for a given secret attribute represents a different cutoff threshold for making a prediction (versus making no prediction). Four of the six secret attributes achieve perfect precision.
  • ...and 5 more figures