Table of Contents
Fetching ...

A Refreshment Stirred, Not Shaken (III): Can Swapping Be Differentially Private?

James Bailie, Ruobin Gong, Xiao-Li Meng

TL;DR

The paper develops a unified five-building-block framework for differential privacy (DP) specifications, consisting of the domain $\mathcal{X}$, multiverse $\mathscr{D}$, input premetric $d_{\mathcal{X}}$, output premetric $D_{\Pr}$, and protection loss budget $\varepsilon_{\mathcal{D}}$, and presents a Lipschitz-type condition $D_{\Pr}(\mathsf P_{\bm x},\mathsf P_{\bm x'}) \le \varepsilon_{\mathcal{D}}\, d_{\mathcal{X}}(\bm x, \bm x')$ to unify DP flavors. It applies this framework to the US Census, contrasting 2010 swapping (data swapping) with 2020’s TopDown Algorithm (TDA), showing that swapping can be DP when invariants are accounted for, and that the 2010 and 2020 disclosures occupy different DP specifications with distinct protection units and invariants. The paper argues that DP and traditional statistical disclosure control (SDC) can be reconciled to reap the strengths of both, while highlighting the risks and tradeoffs introduced by invariants, transparency, and epistemic uncertainty. It also discusses practical strategies to mitigate invariant-induced risks, including probabilistic matching and pre/post-swap perturbations, and contemplates extending DP to embrace epistemic uncertainty via imprecise probabilities. Overall, the work provides a rigorous, compositional lens for understanding privacy-utility tradeoffs in large-scale releases like censuses, and it clarifies when swapping can be considered DP within an expanded specification framework.

Abstract

The quest for a precise and contextually grounded answer to the question in the present paper's title resulted in this stirred-not-shaken triptych, a phrase that reflects our desire to deepen the theoretical basis, broaden the practical applicability, and reduce the misperception of differential privacy (DP)$\unicode{x2014}$all without shaking its core foundations. Indeed, given the existence of more than 200 formulations of DP (and counting), before even attempting to answer the titular question one must first precisely specify what it actually means to be DP. Motivated by this observation, a theoretical investigation into DP's fundamental essence resulted in Part I of this trio, which introduces a five-building-block system explicating the who, where, what, how and how much aspects of DP. Instantiating this system in the context of the United States Decennial Census, Part II then demonstrates the broader applicability and relevance of DP by comparing a swapping strategy like that used in 2010 with the TopDown Algorithm$\unicode{x2014}$a DP method adopted in the 2020 Census. This paper provides nontechnical summaries of the preceding two parts as well as new discussion$\unicode{x2014}$for example, on how greater awareness of the five building blocks can thwart privacy theatrics; how our results bridging traditional SDC and DP allow a data custodian to reap the benefits of both these fields; how invariants impact disclosure risk; and how removing the implicit reliance on aleatoric uncertainty could lead to new generalizations of DP.

A Refreshment Stirred, Not Shaken (III): Can Swapping Be Differentially Private?

TL;DR

The paper develops a unified five-building-block framework for differential privacy (DP) specifications, consisting of the domain , multiverse , input premetric , output premetric , and protection loss budget , and presents a Lipschitz-type condition to unify DP flavors. It applies this framework to the US Census, contrasting 2010 swapping (data swapping) with 2020’s TopDown Algorithm (TDA), showing that swapping can be DP when invariants are accounted for, and that the 2010 and 2020 disclosures occupy different DP specifications with distinct protection units and invariants. The paper argues that DP and traditional statistical disclosure control (SDC) can be reconciled to reap the strengths of both, while highlighting the risks and tradeoffs introduced by invariants, transparency, and epistemic uncertainty. It also discusses practical strategies to mitigate invariant-induced risks, including probabilistic matching and pre/post-swap perturbations, and contemplates extending DP to embrace epistemic uncertainty via imprecise probabilities. Overall, the work provides a rigorous, compositional lens for understanding privacy-utility tradeoffs in large-scale releases like censuses, and it clarifies when swapping can be considered DP within an expanded specification framework.

Abstract

The quest for a precise and contextually grounded answer to the question in the present paper's title resulted in this stirred-not-shaken triptych, a phrase that reflects our desire to deepen the theoretical basis, broaden the practical applicability, and reduce the misperception of differential privacy (DP)all without shaking its core foundations. Indeed, given the existence of more than 200 formulations of DP (and counting), before even attempting to answer the titular question one must first precisely specify what it actually means to be DP. Motivated by this observation, a theoretical investigation into DP's fundamental essence resulted in Part I of this trio, which introduces a five-building-block system explicating the who, where, what, how and how much aspects of DP. Instantiating this system in the context of the United States Decennial Census, Part II then demonstrates the broader applicability and relevance of DP by comparing a swapping strategy like that used in 2010 with the TopDown Algorithma DP method adopted in the 2020 Census. This paper provides nontechnical summaries of the preceding two parts as well as new discussionfor example, on how greater awareness of the five building blocks can thwart privacy theatrics; how our results bridging traditional SDC and DP allow a data custodian to reap the benefits of both these fields; how invariants impact disclosure risk; and how removing the implicit reliance on aleatoric uncertainty could lead to new generalizations of DP.

Paper Structure

This paper contains 16 sections, 3 equations, 1 figure.

Figures (1)

  • Figure 1: Schematic of a differential privacy specification$\varepsilon_{\color{cbrown}\mathcal{D}}$-DP$({\color{cgreen2}\mathcal{X}}, {\color{cbrown}\mathscr D}, {\color{cred}d_{\mathcal{X}}}, {\color{cgreen}D_{\Pr}})$. The domain$\mathcal{X}$ is the set of all possible datasets (be they actual, potential or counterfactual). We denote two arbitrary datasets by $\bm{x}$ and $\bm{x}'$; other possible datasets are depicted by gray circles. The multiverse$\mathscr{D} = \{ \mathcal{D}_1, \mathcal{D}_2, \mathcal{D}_3, \mathcal{D}_4, \mathcal{D}_5\}$ is a collection of sets of datasets---these sets are called universes. (In this schematic, $\mathscr{D}$ partitions the domain $\mathcal{X}$, as would happen when $\mathscr{D}$ encodes invariants. In general, this need not be the case. In fact, often the universes may be overlapping.) A data release mechanism $T$ transforms a dataset $\bm{x}$ to a random output $T(\bm{x})$, which is a draw from the probability distribution $\mathsf P\space_{\bm{x}}$. Intuitively, differential privacy requires that similar datasets $\bm{x}$ and $\bm{x}'$ have similar output distributions$\mathsf P\space_{\bm{x}}$ and $\mathsf P\space_{\bm{x}'}$. This is formalized by the Lipschitz condition $D_{\Pr}(\mathsf P\space_{\bm{x}}, \mathsf P\space_{\bm{x}'}) \le \varepsilon_{\mathcal{D}_1} d_{\mathcal{X}}(\bm{x}, \bm{x}')$, which states that the 'distance' $D_{\Pr}(\mathsf P\space_{\bm{x}}, \mathsf P\space_{\bm{x}'})$ between the output distributions is at most a constant multiple $\varepsilon_{\mathcal{D}_1}$ of the 'distance' $d_{\mathcal{X}}(\bm{x}, \bm{x}')$ between the corresponding input datasets. Here, similarity (or 'distance') between datasets is measured by the DP specification's input premetric${\color{cred}d_{\mathcal{X}}}$, visualized above as a caliper, and similarity between probability distributions of the output under different inputs is measured by the DP specification's output premetric${\color{cgreen}D_{\Pr}}$ (the tape measure). For simplicity, we depict the output space $\mathcal{T}$ as one dimensional, although in practice it is frequently a high-dimensional space, or even a union of many different probability spaces (as is the case for local DP). (The PLB above, $\varepsilon_{\mathcal{D}_1}$, has the subscript $\mathcal{D}_1$ because the Lipschitz condition is applied to the datasets $\bm{x}$ and $\bm{x}'$ which are members of the universe $\mathcal{D}_1$, and because the PLB is allowed to vary between universes, potentially taking up to five different values, $\varepsilon_{\mathcal{D}_1}, \varepsilon_{\mathcal{D}_2}, \ldots, \varepsilon_{\mathcal{D}_5}$.)