Table of Contents
Fetching ...

Efficient Multiple-Robust Estimation for Nonresponse Data Under Informative Sampling

Kosuke Morikawa, Kenji Beppu, Wataru Aida

TL;DR

This paper addresses bias from both informative sampling and nonresponse in surveys by formulating a two-step monotone missing data framework with a target parameter $\theta$ defined by $E\{U_\theta(X,Y)\}=0$ and deriving the semiparametric efficiency bounds for settings with and without external data. It develops adaptive estimators—method of moments and empirical likelihood—that achieve these bounds and introduces multiple robustness via two-step empirical likelihood to mitigate misspecification of working models. The analysis shows that incorporating external summary statistics through data fusion further reduces variance, yielding efficient estimators under Setups 1–3, and demonstrates this with a numerical study and a real data application to NHANES/NHIS. The work provides a principled framework for efficient, robust estimation in the presence of informative sampling and nonresponse and offers practical guidance for leveraging external data sources in survey inference.

Abstract

Nonresponse after probability sampling is a universal challenge in survey sampling, often necessitating adjustments to mitigate sampling and selection bias simultaneously. This study explored the removal of bias and effective utilization of available information, not just in nonresponse but also in the scenario of data integration, where summary statistics from other data sources are accessible. We reformulate these settings within a two-step monotone missing data framework, where the first step of missingness arises from sampling and the second originates from nonresponse. Subsequently, we derive the semiparametric efficiency bound for the target parameter. We also propose adaptive estimators utilizing methods of moments and empirical likelihood approaches to attain the lower bound. The proposed estimator exhibits both efficiency and double robustness. However, attaining efficiency with an adaptive estimator requires the correct specification of certain working models. To reinforce robustness against the misspecification of working models, we extend the property of double robustness to multiple robustness by proposing a two-step empirical likelihood method that effectively leverages empirical weights. A numerical study is undertaken to investigate the finite-sample performance of the proposed methods. We further applied our methods to a dataset from the National Health and Nutrition Examination Survey data by efficiently incorporating summary statistics from the National Health Interview Survey data.

Efficient Multiple-Robust Estimation for Nonresponse Data Under Informative Sampling

TL;DR

This paper addresses bias from both informative sampling and nonresponse in surveys by formulating a two-step monotone missing data framework with a target parameter defined by and deriving the semiparametric efficiency bounds for settings with and without external data. It develops adaptive estimators—method of moments and empirical likelihood—that achieve these bounds and introduces multiple robustness via two-step empirical likelihood to mitigate misspecification of working models. The analysis shows that incorporating external summary statistics through data fusion further reduces variance, yielding efficient estimators under Setups 1–3, and demonstrates this with a numerical study and a real data application to NHANES/NHIS. The work provides a principled framework for efficient, robust estimation in the presence of informative sampling and nonresponse and offers practical guidance for leveraging external data sources in survey inference.

Abstract

Nonresponse after probability sampling is a universal challenge in survey sampling, often necessitating adjustments to mitigate sampling and selection bias simultaneously. This study explored the removal of bias and effective utilization of available information, not just in nonresponse but also in the scenario of data integration, where summary statistics from other data sources are accessible. We reformulate these settings within a two-step monotone missing data framework, where the first step of missingness arises from sampling and the second originates from nonresponse. Subsequently, we derive the semiparametric efficiency bound for the target parameter. We also propose adaptive estimators utilizing methods of moments and empirical likelihood approaches to attain the lower bound. The proposed estimator exhibits both efficiency and double robustness. However, attaining efficiency with an adaptive estimator requires the correct specification of certain working models. To reinforce robustness against the misspecification of working models, we extend the property of double robustness to multiple robustness by proposing a two-step empirical likelihood method that effectively leverages empirical weights. A numerical study is undertaken to investigate the finite-sample performance of the proposed methods. We further applied our methods to a dataset from the National Health and Nutrition Examination Survey data by efficiently incorporating summary statistics from the National Health Interview Survey data.
Paper Structure (18 sections, 5 theorems, 65 equations, 3 figures)

This paper contains 18 sections, 5 theorems, 65 equations, 3 figures.

Key Result

Theorem 1

The efficient score function in Setting 1 is where $C_\theta(x)$ is already defined in morikawa and $D_\theta(r, x, y, z, w) = rU_\theta(x, y)/\pi(x, z, w) + \{1 - r/\pi(x, z, w)\}g_\theta(x,z,w)$. The efficient score function in Setting 2 is eff_M with the same $D_\theta$ as above but different $C_\theta(x)=C_\theta=E\{(W-1)U_\theta(X, Y)\}/

Figures (3)

  • Figure 1: Directed acyclic graphs when it is appropriate to treat $W$ as a random variable. On the left is the perspective of the sampling weight designer, and on the right is the perspective of the data analyst. The dashed lines (crossed out in red) represent dependence relations excluded under the PMAR and RSCI assumptions, while solid arrows denote direct dependence.
  • Figure 2: Three settings considered in this study. The data highlighted in black represents observed data, and the entries labeled as "mis" indicate unsampled or nonresponse.
  • Figure 3: (Left) Results in Setting 1: $\mathrm{HT}_i$ (Horvitz-Thompson), $\mathrm{KH}_{ij}$ (Kim and Haziza), proposed methods of moments double-robust estimators $\mathrm{MM}_{ij}$ (methods of moments), and proposed empirical likelihood based multiple-robust estimators $\mathrm{EL}_{ij|kl}$, where $i,j,k,l \in \{0,1\}$. The indices $i$ and $j$, and $k$ and $l$ take on values of one if the two working models are correct and values of zero otherwise. (Right) Results in Settings 2 and 3: $\mathrm{EL}_{ij|kl}\,(i,j,k,l \in \{0,1\})$ in Setting 2 and $\mathrm{EL}_{ij|kl}^{(N_1)}\,(i,j,k,l \in \{0,1\})$ in Setting 3 with the sample size of the external source being $N_1=100$ or $10,000$.

Theorems & Definitions (5)

  • Theorem 1
  • Lemma 1
  • Theorem 2
  • Theorem 3
  • Theorem 4