Table of Contents
Fetching ...

Evaluating Bias and Noise Induced by the U.S. Census Bureau's Privacy Protection Methods

Christopher T. Kenny, Cory McCartan, Shiro Kuriwaki, Tyler Simko, Kosuke Imai

TL;DR

The study addresses how the Census Bureau's privacy-protection methods—TopDown for 2020 and swapping in earlier censuses—affect bias and noise in published statistics. By exploiting the Noisy Measurement File and independent TopDown runs on 2010 data, the authors quantify average bias and RMSE relative to the Census Edited File, showing that NMF is too noisy for direct use while TopDown post-processing substantially reduces variance to levels similar to swapping, with larger errors in small-population geographies. Across most geographies and racial groups, both methods yield near-unbiased counts, though Hispanic and multiracial groups exhibit higher RMSE, and off-spine geographies show more pronounced errors. The results imply that, for large geographies, privacy-induced errors are small relative to other census errors, while small geographies warrant caution; the paper also provides a framework and estimators for independent external evaluation of disclosure avoidance systems.

Abstract

The United States Census Bureau faces a difficult trade-off between the accuracy of Census statistics and the protection of individual information. We conduct the first independent evaluation of bias and noise induced by the Bureau's two main disclosure avoidance systems: the TopDown algorithm employed for the 2020 Census and the swapping algorithm implemented for the three previous Censuses. Our evaluation leverages the Noisy Measure File (NMF) as well as two independent runs of the TopDown algorithm applied to the 2010 decennial Census. We find that the NMF contains too much noise to be directly useful, especially for Hispanic and multiracial populations. TopDown's post-processing dramatically reduces the NMF noise and produces data whose accuracy is similar to that of swapping. While the estimated errors for both TopDown and swapping algorithms are generally no greater than other sources of Census error, they can be relatively substantial for geographies with small total populations.

Evaluating Bias and Noise Induced by the U.S. Census Bureau's Privacy Protection Methods

TL;DR

The study addresses how the Census Bureau's privacy-protection methods—TopDown for 2020 and swapping in earlier censuses—affect bias and noise in published statistics. By exploiting the Noisy Measurement File and independent TopDown runs on 2010 data, the authors quantify average bias and RMSE relative to the Census Edited File, showing that NMF is too noisy for direct use while TopDown post-processing substantially reduces variance to levels similar to swapping, with larger errors in small-population geographies. Across most geographies and racial groups, both methods yield near-unbiased counts, though Hispanic and multiracial groups exhibit higher RMSE, and off-spine geographies show more pronounced errors. The results imply that, for large geographies, privacy-induced errors are small relative to other census errors, while small geographies warrant caution; the paper also provides a framework and estimators for independent external evaluation of disclosure avoidance systems.

Abstract

The United States Census Bureau faces a difficult trade-off between the accuracy of Census statistics and the protection of individual information. We conduct the first independent evaluation of bias and noise induced by the Bureau's two main disclosure avoidance systems: the TopDown algorithm employed for the 2020 Census and the swapping algorithm implemented for the three previous Censuses. Our evaluation leverages the Noisy Measure File (NMF) as well as two independent runs of the TopDown algorithm applied to the 2010 decennial Census. We find that the NMF contains too much noise to be directly useful, especially for Hispanic and multiracial populations. TopDown's post-processing dramatically reduces the NMF noise and produces data whose accuracy is similar to that of swapping. While the estimated errors for both TopDown and swapping algorithms are generally no greater than other sources of Census error, they can be relatively substantial for geographies with small total populations.
Paper Structure (24 sections, 4 theorems, 31 equations, 9 figures, 2 tables)

This paper contains 24 sections, 4 theorems, 31 equations, 9 figures, 2 tables.

Key Result

Proposition 2.1

For a single run of the TopDown Algorithm, the following independence relations hold:

Figures (9)

  • Figure 1: The Census geographic hierarchy (spines). Spines for standard census geographies and for the hierarchy in the Noisy Measurement File (NMF). Higher units indicate enclosing units. In the right spine, sub-state geographic units are split into American Indian / Alaska Native (ai/an) and non-ai/an portions. The dotted arrow indicates that not all states have an ai/an segment. The teal area indicates that the units can be matched one-to-one across spines. For example, a single block is never split into a Non-ai/an vs. ai/an fragment, so a block from the NMF spine matches 1:1 to a block in the standard spine.
  • Figure 2: Distribution of absolute count error in total population. The figure displays the absolute error in total populations as enumerated in the Census Edited File (CEF). The y-axis is shown on a pseudo-log10 scale. Each panel depicts a geographical level where a boxplot shows the nationwide distribution of population error at that geographical level, with horizontal lines for the first quartile, median, and third quartile. Swapping errors are always zero for census geographies, as total population remains invariant. The Bureau-estimated mean absolute error due to coverage and non-sampling errors at the block level are indicated on the leftmost pattern by horizontal lines. Coverage and non-sampling error affect the CEF and thus are present in addition to DAS errors from TopDown, swapping, or noisy measurements.
  • Figure 3: Estimated root mean square error (RMSE) for population counts of a race/ethnicity group, at each geographic level. The RMSE quantifies the average magnitude of error for a given geography for a particular geographic unit (see Section \ref{['sec-estimators']} for estimators). Triangles for RMSE indicate that the estimated mean square error was negative and hence was set to zero.
  • Figure 4: Average bias for race/ethnicity population counts at each geographic level by its total population. The figure estimates the average overcounting or undercounting in a group of geographies, separately for five geographic levels and five race/ethnicity groups. See Section \ref{['sec-estimators']} for estimators. Bins on the y-axis are deciles of total population of the geographic level Points show the estimated bias, and lines show estimated 95% confidence intervals.
  • Figure 5: Estimated root mean square error of race/ethnicity counts at each geographic level, by its total population. Unlike Figure \ref{['fig-rmse-national']}, it estimates RMSE for a subset of geographies. See Section \ref{['sec-estimators']} for estimators. Bins on the y-axis are deciles by total population of the geographic level.
  • ...and 4 more figures

Theorems & Definitions (8)

  • Proposition 2.1
  • Proposition 2.2
  • Proposition 2.3
  • Proposition 2.4
  • proof
  • proof
  • proof
  • proof