Table of Contents
Fetching ...

Predicting Census Survey Response Rates With Parsimonious Additive Models and Structured Interactions

Shibal Ibrahim, Peter Radchenko, Emanuel Ben-David, Rahul Mazumder

TL;DR

This work addresses predicting census self-response rates with interpretable, sparse nonparametric additive models that include nonlinear main effects and pairwise interactions. The authors develop ELAAN-I and the hierarchy-enforcing ELAAN-H, leveraging an $ ext{ell}_{0}$-penalty and scalable block coordinate descent (with active sets) and MIP formulations to handle large-scale data ($n o10^5$, $p o ext{hundreds}$). They provide nonasymptotic statistical guarantees and demonstrate, on the US Census Planning Database, that these models achieve predictive accuracy on par with black-box methods while using far fewer components, enabling interpretable insights and operational actions for targeted outreach. The case study reveals interactions that map to known census clusters and mindsets, illustrating the framework's potential to guide resource allocation and survey design in practice.

Abstract

In this paper, we consider the problem of predicting survey response rates using a family of flexible and interpretable nonparametric models. The study is motivated by the US Census Bureau's well-known ROAM application, which uses a linear regression model trained on the US Census Planning Database data to identify hard-to-survey areas. A crowdsourcing competition (Erdman and Bates, 2016) organized more than ten years ago revealed that machine learning methods based on ensembles of regression trees led to the best performance in predicting survey response rates; however, the corresponding models could not be adopted for the intended application due to their black-box nature. We consider nonparametric additive models with a small number of main and pairwise interaction effects using $\ell_0$-based penalization. From a methodological viewpoint, we study our estimator's computational and statistical aspects and discuss variants incorporating strong hierarchical interactions. Our algorithms (open-sourced on GitHub) extend the computational frontiers of existing algorithms for sparse additive models to be able to handle datasets relevant to the application we consider. We discuss and interpret findings from our model on the US Census Planning Database. In addition to being useful from an interpretability standpoint, our models lead to predictions comparable to popular black-box machine learning methods based on gradient boosting and feedforward neural networks - suggesting that it is possible to have models that have the best of both worlds: good model accuracy and interpretability.

Predicting Census Survey Response Rates With Parsimonious Additive Models and Structured Interactions

TL;DR

This work addresses predicting census self-response rates with interpretable, sparse nonparametric additive models that include nonlinear main effects and pairwise interactions. The authors develop ELAAN-I and the hierarchy-enforcing ELAAN-H, leveraging an -penalty and scalable block coordinate descent (with active sets) and MIP formulations to handle large-scale data (, ). They provide nonasymptotic statistical guarantees and demonstrate, on the US Census Planning Database, that these models achieve predictive accuracy on par with black-box methods while using far fewer components, enabling interpretable insights and operational actions for targeted outreach. The case study reveals interactions that map to known census clusters and mindsets, illustrating the framework's potential to guide resource allocation and survey design in practice.

Abstract

In this paper, we consider the problem of predicting survey response rates using a family of flexible and interpretable nonparametric models. The study is motivated by the US Census Bureau's well-known ROAM application, which uses a linear regression model trained on the US Census Planning Database data to identify hard-to-survey areas. A crowdsourcing competition (Erdman and Bates, 2016) organized more than ten years ago revealed that machine learning methods based on ensembles of regression trees led to the best performance in predicting survey response rates; however, the corresponding models could not be adopted for the intended application due to their black-box nature. We consider nonparametric additive models with a small number of main and pairwise interaction effects using -based penalization. From a methodological viewpoint, we study our estimator's computational and statistical aspects and discuss variants incorporating strong hierarchical interactions. Our algorithms (open-sourced on GitHub) extend the computational frontiers of existing algorithms for sparse additive models to be able to handle datasets relevant to the application we consider. We discuss and interpret findings from our model on the US Census Planning Database. In addition to being useful from an interpretability standpoint, our models lead to predictions comparable to popular black-box machine learning methods based on gradient boosting and feedforward neural networks - suggesting that it is possible to have models that have the best of both worlds: good model accuracy and interpretability.

Paper Structure

This paper contains 47 sections, 5 theorems, 54 equations, 7 figures, 7 tables.

Key Result

Theorem 2.1

Let $\widehat{f}_n$ be defined as in (crit.add1). Then, there exists a universal constant $c_1$, such that if $\lambda_n\ge c_1\sigma[r_n^2+r_n\sqrt{\log (ep)/n}\,]$, then with probability at least $1-1/p$.

Figures (7)

  • Figure 1: The 2013-2017 American Community Survey self-response rates for all tracts in the continental United States. The North, in general, and the Upper Midwest and the Northeast, in particular, have higher self-response rates than the rest of the country. Tracts with lower self-response rates are visible in many states -- in particular, in the South and in the Mountain region.
  • Figure 2: Panels [Left]-[Right] illustrate marginal nonparametric fits for the self-response rate output variable versus three covariates. Each marginal fit, displayed on a scatter plot with a solid blue line, clearly suggests a nonlinear relationship of the output vs the individual covariate (we note that the covariates are standardized.) The $x$-axis corresponds to: [Left] "Persons of Hispanic origin in the ACS"; [Middle] "Number of households that have only a smartphone and no other computing device"; [Right] "Persons 25 years and over with college degree or higher in the ACS".
  • Figure 3: Sparsity pattern of the main and interaction effects presented in a $p \times p$ matrix: a black square on the diagonal indicates the presence of a main effect, and an off-diagonal black square indicates the presence of an interaction effect in the joint model. [Left] Panel illustrates the sparsity pattern of a Lasso model with main and interaction effects. There are 37 main and 555 interaction effects in the optimal model. [Right] Panel illustrates the sparsity pattern of a nonlinear AM with main and interaction effects, i.e., model \ref{['eq: GAM with interactions L0 FunForm']}. There are 8 main and 92 interaction effects in the optimal model. Model \ref{['eq: GAM with interactions L0 FunForm']} has prediction performance similar to the Lasso model, with only 3 main and 33 interaction effects. Nonlinear models lead to much more compact models and, hence, are easier to interpret than linear models with interactions. Both models were trained on a 2019 US Census Bureau Planning Database dataset (predicting the tract-level self-response rate) with $p = 40$ covariates and $74,000$ observations.
  • Figure 4: Predicted ACS self-response rates for all tracts in the United States.
  • Figure 5: ACS self-response rates for all tracts in the District of Columbia. a) Actual ACS self-response rates. b) Predicted ACS self-response rates for AMs with interactions \ref{['eq: GAM with interactions L0']}. c) Difference between the actual and predicted self-response rate: difference = actual - predicted.
  • ...and 2 more figures

Theorems & Definitions (5)

  • Theorem 2.1
  • Corollary 1
  • Lemma S2.1
  • Lemma S2.2
  • Lemma S2.3