Table of Contents
Fetching ...

Survey Data Integration for Distribution Function Estimation

Jeremy Flood, Sayed Mostafa

TL;DR

This work addresses estimating finite-population distribution functions and quantiles, $F_N(t)$ and $T_N(\alpha)$, when outcomes are observed only in a nonprobability sample and covariates are available in both a probability sample and the nonprobability sample. It introduces a residual-based CDF estimator $\widehat{F}_\text{R}(t; \boldsymbol{\widehat{\beta}})$ that combines design-based weights with a model-derived residual distribution learned from $\mathcal{B}$, along with a corresponding quantile estimator $\widehat{T}_\text{R}(\alpha)$, and develops linearization and bootstrap variance estimators. The paper provides a rigorous asymptotic theory (including Lemma 1 and Theorems 1–4), compares to naïve and plug-in alternatives, and validates the approach through extensive simulations under MAR and MNAR, as well as a NHANES real-data example. Findings show substantial efficiency gains of the residual-based approach over alternatives, particularly when the nonprobability sample is large and selection is MAR, with noted limitations under nonignorable selection and model misspecification. The work offers a principled, model-assisted path for distributional inference in data integration, with practical implications for policy, public health, and resource allocation.

Abstract

Estimates of finite population cumulativedistribution functions (CDFs) and quantiles are critical forpolicy-making, resource allocation, and public health planning. For instance, federal finance agencies may require accurate estimates of the proportion of individuals with income below the federal poverty line to determine funding eligibility, while health organizations may rely on precise quantile estimates of key health variables to guide local health interventions. Despite growing interest in survey data integration, research on the integration of probability and nonprobability samples toestimate CDFs and quantiles remains limited. In this study, we propose a novel residual-based CDF estimator that integrates information from a probability sample with data from potentially large nonprobability samples. Our approach leverages shared covariates observed in both datasets, while the response variable is available only in the nonprobability sample. Using a semiparametric approach, we train an outcome model on the nonprobability sample and incorporate model residuals with sampling weights from the probability sample to estimate the CDF of the target variable. Based on this CDF estimator, we define a quantile estimator and introduce linearization and bootstrap methods for variance estimation of both the CDF and quantile estimators. Under certain regularity conditions, we establish the asymptotic properties, including bias and variance, of the CDF estimator. Our empirical findings support the theoretical results and demonstrate the favorable performance of the proposed estimators relative to plug-in mass imputation estimators and the naïve estimators derived from the nonprobability sample only. A real data example is presented to illustrate the proposed estimators.

Survey Data Integration for Distribution Function Estimation

TL;DR

This work addresses estimating finite-population distribution functions and quantiles, and , when outcomes are observed only in a nonprobability sample and covariates are available in both a probability sample and the nonprobability sample. It introduces a residual-based CDF estimator that combines design-based weights with a model-derived residual distribution learned from , along with a corresponding quantile estimator , and develops linearization and bootstrap variance estimators. The paper provides a rigorous asymptotic theory (including Lemma 1 and Theorems 1–4), compares to naïve and plug-in alternatives, and validates the approach through extensive simulations under MAR and MNAR, as well as a NHANES real-data example. Findings show substantial efficiency gains of the residual-based approach over alternatives, particularly when the nonprobability sample is large and selection is MAR, with noted limitations under nonignorable selection and model misspecification. The work offers a principled, model-assisted path for distributional inference in data integration, with practical implications for policy, public health, and resource allocation.

Abstract

Estimates of finite population cumulativedistribution functions (CDFs) and quantiles are critical forpolicy-making, resource allocation, and public health planning. For instance, federal finance agencies may require accurate estimates of the proportion of individuals with income below the federal poverty line to determine funding eligibility, while health organizations may rely on precise quantile estimates of key health variables to guide local health interventions. Despite growing interest in survey data integration, research on the integration of probability and nonprobability samples toestimate CDFs and quantiles remains limited. In this study, we propose a novel residual-based CDF estimator that integrates information from a probability sample with data from potentially large nonprobability samples. Our approach leverages shared covariates observed in both datasets, while the response variable is available only in the nonprobability sample. Using a semiparametric approach, we train an outcome model on the nonprobability sample and incorporate model residuals with sampling weights from the probability sample to estimate the CDF of the target variable. Based on this CDF estimator, we define a quantile estimator and introduce linearization and bootstrap methods for variance estimation of both the CDF and quantile estimators. Under certain regularity conditions, we establish the asymptotic properties, including bias and variance, of the CDF estimator. Our empirical findings support the theoretical results and demonstrate the favorable performance of the proposed estimators relative to plug-in mass imputation estimators and the naïve estimators derived from the nonprobability sample only. A real data example is presented to illustrate the proposed estimators.
Paper Structure (18 sections, 5 theorems, 43 equations, 6 figures, 5 tables)

This paper contains 18 sections, 5 theorems, 43 equations, 6 figures, 5 tables.

Key Result

Lemma 1

Under Assumptions a1--a6, and

Figures (6)

  • Figure 1: RMSER for the naïve, plug-in, and residual-based CDF and quantile estimators when $n_\text{B} = n_\text{A}$.
  • Figure 2: RMSER for the naïve, plug-in, and residual-based CDF and quantile estimators when $n_\text{B} = 10n_\text{A}$.
  • Figure 3: RMSER for the naïve, plug-in, and residual-based CDF and quantile estimators when $n_\text{B} = 20n_\text{A}$.
  • Figure 4: Performance statistics for the variance estimators of $\widehat{F}_\text{R}(t; \boldsymbol{\widehat{\beta}})$ under models $\xi_{1}$ and $\xi_{3}$, with $n_\text{A} = 1{\small,}000$.
  • Figure 5: Performance statistics for the variance estimators of $\widehat{T}_\text{R}(\alpha)$ under models $\xi_{1}$ and $\xi_{3}$, with $n_\text{A} = 1{\small,}000$.
  • ...and 1 more figures

Theorems & Definitions (8)

  • Definition 1: Positivity Condition
  • Definition 2: Transportability Condition
  • Definition 3: Ignorability Condition
  • Lemma 1
  • Theorem 1
  • Theorem 2
  • Theorem 3
  • Theorem 4