Table of Contents
Fetching ...

"Minus-One" Data Prediction Generates Synthetic Census Data with Good Crosstabulation Fidelity

William H. Press

TL;DR

This work presents Minus-One Data Prediction (MODP), a method to synthesize census-like data by learning, for each question, a probabilistic predictor conditioned on all other questions and then sampling synthetic responses. Using a non-self-predicting, multi-blade neural architecture, the approach achieves strong crosstabulation fidelity on STUMS (and STUMS-H) with median fractional errors around 5% and substantial plausible deniability due to row-wise entropy in the synthetic draws. The study demonstrates practical feasibility for generating synthetic census data that preserve two-way associations while offering privacy protections, and discusses limitations (e.g., structural zeros, lack of formal DP) and scalability to larger PUMS datasets. Overall, MODP provides a modular, scalable framework for producing useful synthetic categorical data with strong empirical fidelity to observed joint distributions, supported by extensive accuracy and privacy analyses.

Abstract

We propose to capture relevant statistical associations in a dataset of categorical survey responses by a method, here termed MODP, that "learns" a probabilistic prediction function L. Specifically, L predicts each question's response based on the same respondent's answers to all the other questions. Draws from the resulting probability distribution become synthetic responses. Applying this methodology to the PUMS subset of Census ACS data, and with a learned L akin to multiple parallel logistic regression, we generate synthetic responses whose crosstabulations (two-point conditionals) are found to have a median accuracy of ~5% across all crosstabulation cells, with cell counts ranging over four orders of magnitude. We investigate and attempt to quantify the degree to which the privacy of the original data is protected.

"Minus-One" Data Prediction Generates Synthetic Census Data with Good Crosstabulation Fidelity

TL;DR

This work presents Minus-One Data Prediction (MODP), a method to synthesize census-like data by learning, for each question, a probabilistic predictor conditioned on all other questions and then sampling synthetic responses. Using a non-self-predicting, multi-blade neural architecture, the approach achieves strong crosstabulation fidelity on STUMS (and STUMS-H) with median fractional errors around 5% and substantial plausible deniability due to row-wise entropy in the synthetic draws. The study demonstrates practical feasibility for generating synthetic census data that preserve two-way associations while offering privacy protections, and discusses limitations (e.g., structural zeros, lack of formal DP) and scalability to larger PUMS datasets. Overall, MODP provides a modular, scalable framework for producing useful synthetic categorical data with strong empirical fidelity to observed joint distributions, supported by extensive accuracy and privacy analyses.

Abstract

We propose to capture relevant statistical associations in a dataset of categorical survey responses by a method, here termed MODP, that "learns" a probabilistic prediction function L. Specifically, L predicts each question's response based on the same respondent's answers to all the other questions. Draws from the resulting probability distribution become synthetic responses. Applying this methodology to the PUMS subset of Census ACS data, and with a learned L akin to multiple parallel logistic regression, we generate synthetic responses whose crosstabulations (two-point conditionals) are found to have a median accuracy of ~5% across all crosstabulation cells, with cell counts ranging over four orders of magnitude. We investigate and attempt to quantify the degree to which the privacy of the original data is protected.
Paper Structure (20 sections, 12 equations, 17 figures, 4 tables)

This paper contains 20 sections, 12 equations, 17 figures, 4 tables.

Figures (17)

  • Figure 1: Changes in the univariate and crosstabular counts of the STUMS data set under bootstrap resampling Efron81 of its 292,919 rows. In the main panel, the discrete lines of dots correspond to $0,1,2,\ldots$ counts in the true or synthetic data. (Zeros are plotted as 1 for visibility.) Achieving a similar scatter with respect to true data is a (likely unobtainable) goal of any method for producing synthetic data---the best we can hope for.
  • Figure 2: Univariate and crosstabular predictions of a primitive trained model with a single NonSelfPredictingLayer and loss function torch.nn.MSELoss. Without any direct knowledge of the data's crosstabulations, the model nevertheless reproduces crosstabulation counts to a considerable degree. Visible are this model's deficiencies in populating true structural zeros and cells with small numbers of true counts. More complex models, below, will do much better.
  • Figure 3: For a trained three-blade prediction model, the weights of individual records are shown as a triangular scatter plot. The concentration of occurrences at vertices and along edges, and the sparsity of points in the interior, suggest that the blades have learned to specialize in predicting different sub-populations of records.
  • Figure 4: Results for a 5-blade model trained with the loss function zval_loss_function. Other than small tweaks (see text) this is our best-performing model. (Compare to hoped-for ideal in Figure \ref{['fig:perfection']}.)
  • Figure 5: The model shown in Figure \ref{['fig:fiveblade']} is here tweaked to remove structural zeros. Performance is otherwise comparable.
  • ...and 12 more figures