Table of Contents
Fetching ...

On Data Thinning for Model Validation in Small Area Estimation

Sho Kawano, Paul A. Parker, Zehang Richard Li

Abstract

Small area estimation (SAE) produces estimates of population parameters for geographic and demographic subgroups with limited sample sizes. Such estimates are critical for informing policy decisions, ranging from poverty mapping to social program funding. Despite its widespread use, principled validation of SAE models remains challenging and general guidelines are far from well-established. Unlike conventional predictive modeling settings, validation data are rarely available in the SAE context. External validation surveys or censuses often do not exist, and access to individual-level microdata is often restricted, making standard cross-validation infeasible. In this paper, we propose a novel model validation scheme using only area-level direct survey estimates under the widely used Fay--Herriot model. Our approach is based on data thinning, which splits area-level observations into independent training and test components to enable out-of-sample validation. Our theoretical analysis reveals a fundamental tension inherent in thinning-based validating: performance metrics measured on the thinned training component targets a different quantity than that based on the full data, with the gap varying by model complexity. Increasing the information allocated for training reduces this gap but inflates the variance of the estimator. We formally characterize this bias-variance tradeoff and provide practical recommendations for the thinning parameters that balance these competing considerations for model comparison. We show that data thinning with these settings provides consistent and stable performance across heterogeneous sampling designs in design-based simulations using American Community Survey microdata.

On Data Thinning for Model Validation in Small Area Estimation

Abstract

Small area estimation (SAE) produces estimates of population parameters for geographic and demographic subgroups with limited sample sizes. Such estimates are critical for informing policy decisions, ranging from poverty mapping to social program funding. Despite its widespread use, principled validation of SAE models remains challenging and general guidelines are far from well-established. Unlike conventional predictive modeling settings, validation data are rarely available in the SAE context. External validation surveys or censuses often do not exist, and access to individual-level microdata is often restricted, making standard cross-validation infeasible. In this paper, we propose a novel model validation scheme using only area-level direct survey estimates under the widely used Fay--Herriot model. Our approach is based on data thinning, which splits area-level observations into independent training and test components to enable out-of-sample validation. Our theoretical analysis reveals a fundamental tension inherent in thinning-based validating: performance metrics measured on the thinned training component targets a different quantity than that based on the full data, with the gap varying by model complexity. Increasing the information allocated for training reduces this gap but inflates the variance of the estimator. We formally characterize this bias-variance tradeoff and provide practical recommendations for the thinning parameters that balance these competing considerations for model comparison. We show that data thinning with these settings provides consistent and stable performance across heterogeneous sampling designs in design-based simulations using American Community Survey microdata.

Paper Structure

This paper contains 44 sections, 9 theorems, 101 equations, 8 figures, 2 tables, 4 algorithms.

Key Result

Theorem 3.2

The estimator $\widehat{\mathrm{MSE}}_{\epsilon}$ is unbiased for the thinned-data oracle MSE: where the expectation is taken over the joint distribution of $(y^{(1)}, y^{(2)})$, unconditional on $y$. $\blacktriangleleft$$\blacktriangleleft$

Figures (8)

  • Figure 1: Spatial covariate effects for the Fay--Herriot model for example data created using PUMS for California. Using $p = 6$ basis functions results in much more spatial smoothing. The model with $p = 42$ shows much finer local variation, particularly in the north and the southern regions of the state including Greater Los Angeles, shown in the zoomed-in rectangle. We use this as our empirical model validation example in subsequent sections.
  • Figure 2: Average realized thinning gap for Fay--Herriot models with $p = 6, 18, 30, 42$ spatial basis functions, averaged over 50 independent samples. Each panel corresponds to an equal allocation design with the indicated target $n$. Complex models (higher $p$) exhibit larger gaps, particularly at low $\epsilon$.
  • Figure 3: Variance of the MSE estimator for Fay--Herriot models with $p = 6, 18, 30, 42$ spatial basis functions, computed across 50 independent samples. Each panel corresponds to an equal allocation survey design with the indicated sample size per area. The variance is minimized at $\epsilon \approx 0.3$--$0.4$, with notable increases for $\epsilon \geq 0.8$.
  • Figure 4: The thinning gap-variance trade-off for Fay--Herriot models with $p = 6, 18, 30, 42$ spatial basis functions. Curves show the sum of squared thinning gap and variance of the MSE estimator averaged across 50 samples from each design. The curves are relatively flat for $\epsilon$ between $0.4$ to $0.7$ across different designs. A log-scale version of the same plot is shown in Appendix \ref{['app:log_tradeoff']} which is more helpful to see the differing optima per model and where the between-model differences in curves shrink for $\epsilon>0.5$.
  • Figure 5: Effect of the training fraction $\epsilon$ and the number of repeats $R \in \{1, 3, 5\}$ on basis selection under equal-allocation designs with target sample sizes $n$. Shaded ribbons indicate $\pm 1$ standard errors of the mean, taken over 50 simulated datasets. Panel (a): RMSE from the average oracle basis count. Panel (b): Mean bias; negative values indicate under-selection.
  • ...and 3 more figures

Theorems & Definitions (20)

  • Theorem 3.2: Unbiased MSE estimation
  • proof
  • Proposition 3.3
  • proof
  • Remark 3.4
  • Proposition 3.5
  • Corollary 3.6
  • Proposition 3.7: Variance of the MSE estimator
  • Proposition 3.8: Variance-minimizing $\epsilon$ for the Fay--Herriot model with known parameters
  • proof
  • ...and 10 more