"Minus-One" Data Prediction Generates Synthetic Census Data with Good Crosstabulation Fidelity
William H. Press
TL;DR
This work presents Minus-One Data Prediction (MODP), a method to synthesize census-like data by learning, for each question, a probabilistic predictor conditioned on all other questions and then sampling synthetic responses. Using a non-self-predicting, multi-blade neural architecture, the approach achieves strong crosstabulation fidelity on STUMS (and STUMS-H) with median fractional errors around 5% and substantial plausible deniability due to row-wise entropy in the synthetic draws. The study demonstrates practical feasibility for generating synthetic census data that preserve two-way associations while offering privacy protections, and discusses limitations (e.g., structural zeros, lack of formal DP) and scalability to larger PUMS datasets. Overall, MODP provides a modular, scalable framework for producing useful synthetic categorical data with strong empirical fidelity to observed joint distributions, supported by extensive accuracy and privacy analyses.
Abstract
We propose to capture relevant statistical associations in a dataset of categorical survey responses by a method, here termed MODP, that "learns" a probabilistic prediction function L. Specifically, L predicts each question's response based on the same respondent's answers to all the other questions. Draws from the resulting probability distribution become synthetic responses. Applying this methodology to the PUMS subset of Census ACS data, and with a learned L akin to multiple parallel logistic regression, we generate synthetic responses whose crosstabulations (two-point conditionals) are found to have a median accuracy of ~5% across all crosstabulation cells, with cell counts ranging over four orders of magnitude. We investigate and attempt to quantify the degree to which the privacy of the original data is protected.
