Noisy Measurements Are Important, the Design of Census Products Is Much More Important
John M. Abowd
TL;DR
This commentary argues that improving census privacy for data users hinges on the design of official publication products (the query workload) rather than solely on the raw Noisy Measurement Files (NMFs). It analyzes the 2020 redistricting data within the differential privacy framework, showing that the query strategy ballooned to $16$ billion statistics due to multiple policy constraints, far beyond the $1.5$ billion workload, and that post-processing introduces non-negativity bias. The author advocates relaxing constraints or reconfiguring publication formats to reduce noise and improve uncertainty quantification, supported by examples like Detailed DHC-A, and calls for funded, collaborative tool development. The work has practical implications for future censuses, suggesting ways to balance confidentiality with the needs of redistricting and other statutory uses while maintaining public trust and enabling robust statistical inferences.
Abstract
McCartan et al. (2023) call for "making differential privacy work for census data users." This commentary explains why the 2020 Census Noisy Measurement Files (NMFs) are not the best focus for that plea. The August 2021 letter from 62 prominent researchers asking for production of the direct output of the differential privacy system deployed for the 2020 Census signaled the engagement of the scholarly community in the design of decennial census data products. NMFs, the raw statistics produced by the 2020 Census Disclosure Avoidance System before any post-processing, are one component of that design-the query strategy output. The more important component is the query workload output-the statistics released to the public. Optimizing the query workload-the Redistricting Data (P.L. 94-171) Summary File, specifically-could allow the privacy-loss budget to be more effectively managed. There could be fewer noisy measurements, no post-processing bias, and direct estimates of the uncertainty from disclosure avoidance for each published statistic.
