Table of Contents
Fetching ...

Evaluating the Impacts of Swapping on the US Decennial Census

Maria Ballesteros, Cynthia Dwork, Gary King, Conlan Olson, Manish Raghavan

TL;DR

This paper probes how data swapping, a historical Census disclosure-avoidance method, impacts data utility relative to TopDown differential privacy. It builds a parameterized swapping model from public Census sources and synthetic microdata to enable controlled comparisons with TopDown and ToyDown across multiple states and swap rates. The study finds that swapping increases tract-level racial entropy and biases statistics in directions opposite to those induced by DP, while the variance is often lower than TopDown at realistic swap rates, and it provides a general framework to estimate and potentially debias downstream analyses. These findings highlight that swapping is not a one-to-one replacement for differential privacy and emphasize the need for principled tools to account for swapping-induced effects in downstream research and policy analyses.

Abstract

To meet its dual burdens of providing useful statistics and ensuring privacy of individual respondents, the US Census Bureau has for decades introduced some form of "noise" into published statistics. Initially, they used a method known as "swapping" (1990-2010). In 2020, they switched to an algorithm called TopDown that ensures a form of Differential Privacy. While the TopDown algorithm has been made public, no implementation of swapping has been released and many details of the deployed swapping methodology deployed have been kept secret. Further, the Bureau has not published (even a synthetic) "original" dataset and its swapped version. It is therefore difficult to evaluate the effects of swapping, and to compare these effects to those of other privacy technologies. To address these difficulties we describe and implement a parameterized swapping algorithm based on Census publications, court documents, and informal interviews with Census employees. With this implementation, we characterize the impacts of swapping on a range of statistical quantities of interest. We provide intuition for the types of shifts induced by swapping and compare against those introduced by TopDown. We find that even when swapping and TopDown introduce errors of similar magnitude, the direction in which statistics are biased need not be the same across the two techniques. More broadly, our implementation provides researchers with the tools to analyze and potentially correct for the impacts of disclosure avoidance systems on the quantities they study.

Evaluating the Impacts of Swapping on the US Decennial Census

TL;DR

This paper probes how data swapping, a historical Census disclosure-avoidance method, impacts data utility relative to TopDown differential privacy. It builds a parameterized swapping model from public Census sources and synthetic microdata to enable controlled comparisons with TopDown and ToyDown across multiple states and swap rates. The study finds that swapping increases tract-level racial entropy and biases statistics in directions opposite to those induced by DP, while the variance is often lower than TopDown at realistic swap rates, and it provides a general framework to estimate and potentially debias downstream analyses. These findings highlight that swapping is not a one-to-one replacement for differential privacy and emphasize the need for principled tools to account for swapping-induced effects in downstream research and policy analyses.

Abstract

To meet its dual burdens of providing useful statistics and ensuring privacy of individual respondents, the US Census Bureau has for decades introduced some form of "noise" into published statistics. Initially, they used a method known as "swapping" (1990-2010). In 2020, they switched to an algorithm called TopDown that ensures a form of Differential Privacy. While the TopDown algorithm has been made public, no implementation of swapping has been released and many details of the deployed swapping methodology deployed have been kept secret. Further, the Bureau has not published (even a synthetic) "original" dataset and its swapped version. It is therefore difficult to evaluate the effects of swapping, and to compare these effects to those of other privacy technologies. To address these difficulties we describe and implement a parameterized swapping algorithm based on Census publications, court documents, and informal interviews with Census employees. With this implementation, we characterize the impacts of swapping on a range of statistical quantities of interest. We provide intuition for the types of shifts induced by swapping and compare against those introduced by TopDown. We find that even when swapping and TopDown introduce errors of similar magnitude, the direction in which statistics are biased need not be the same across the two techniques. More broadly, our implementation provides researchers with the tools to analyze and potentially correct for the impacts of disclosure avoidance systems on the quantities they study.

Paper Structure

This paper contains 24 sections, 11 equations, 21 figures, 9 tables.

Figures (21)

  • Figure 1: Various versions of the Census data and the relationships between them. Note that there are two releases of $\mathbf{X}_\textnormal{TopDown}$, which we call $\mathbf{X}_\textnormal{TopDown2021}$ and $\mathbf{X}_\textnormal{TopDown2023}$. We also produce multiple versions of $\mathbf{X}_{\textnormal{Swapped}}$: $\mathbf{X}_{\textnormal{Swapped2}}$ is produced by swapping at a 2% swap rate and $\mathbf{X}_{\textnormal{Swapped10}}$ is produced by swapping at a 10% swap rate. The red arrows show the comparisons made in the panels in Figures \ref{['fig:swapping_errors']} and \ref{['fig:relative_swapping_errors']}. The blue arrow shows the comparison of interest (which we cannot directly observe).
  • Figure 2: Plots of the four quantities $\textnormal{Error}_\gamma^r(\mathbf{X}_{\textnormal{Synthetic}},\mathbf{X}_{\textnormal{Swapped2}})$, $\textnormal{Error}_\gamma^r(\mathbf{X}_{\textnormal{Synthetic}},\mathbf{X}_{\textnormal{Swapped10}})$, $\textnormal{Error}_\gamma^r(\mathbf{X}_{\textnormal{Synthetic}},\mathbf{X}_{\textnormal{ToyDown}})$, $\textnormal{Error}_\gamma^r(\mathbf{X}_{\textnormal{Released}},\mathbf{X}_{\textnormal{TopDown2021}})$ for all $\gamma\in\mathcal{C}_{\textnormal{Alabama}}$, where $\mathcal{C}_{\textnormal{Alabama}}$ is the set of all counties in Alabama. To generate the "Average" plot, for each county, the errors for each race were averaged. Note that the scale of the $y$-axes are different in different plots.
  • Figure 3: Plots of the four quantities $\textnormal{RelativeError}_\gamma^r(\mathbf{X}_{\textnormal{Synthetic}},\mathbf{X}_{\textnormal{Swapped2}})$, $\textnormal{RelativeError}_\gamma^r(\mathbf{X}_{\textnormal{Synthetic}},\mathbf{X}_{\textnormal{Swapped10}})$, $\textnormal{RelativeError}_\gamma^r(\mathbf{X}_{\textnormal{Synthetic}},\mathbf{X}_{\textnormal{ToyDown}})$, and $\textnormal{RelativeError}_\gamma^r(\mathbf{X}_{\textnormal{Released}},\mathbf{X}_{\textnormal{TopDown2021}})$ for all $\gamma\in\mathcal{C}_{\textnormal{Alabama}}$, where $\mathcal{C}_{\textnormal{Alabama}}$ is the set of all counties in Alabama. To generate the "Average" plot, for each county, the relative errors for each race were averaged.
  • Figure 4: The variance of swapping at different swap rates. Error bars show the minimum and maximum of 5 runs of our estimator. The horizontal dotted lines show the variance estimate for TopDown. Note that we do not characterize the error in our estimate for TopDown. Plots for two more states appear in \ref{['fig:variance_plots_vt_nv']}.
  • Figure 5: The effect of swapping on racial entropy at the tract level for a 10% swap rate in Alabama. The left panel shows the effect of each step of swapping on entropy. The right panel shows the overall effect. Corresponding plots for 2% and 10% swap rates in other states appear in \ref{['fig:entropy_al', 'fig:entropy_wi', 'fig:entropy_tx', 'fig:entropy_nv', 'fig:entropy_vt']} in \ref{['appendix:additional_tables_figures']}.
  • ...and 16 more figures