Doubly Robust Machine Learning for Population Size Estimation with Missing Covariates: Application to Gaza Conflict Mortality

Mateo Dulce Rubio; Edward H. Kennedy; Nicholas P. Jewell

Doubly Robust Machine Learning for Population Size Estimation with Missing Covariates: Application to Gaza Conflict Mortality

Mateo Dulce Rubio, Edward H. Kennedy, Nicholas P. Jewell

TL;DR

The paper addresses population size estimation under missing covariates in capture-recapture settings by combining a Missing at Random (MAR) mechanism with a no highest-order interaction (NHOI) identification strategy. It develops nonparametric, doubly robust one-step estimators grounded in the efficient influence function, enabling valid inference with flexible machine learning for nuisance components. Simulations show substantial gains over naive imputation and plug-in methods, with finite-sample accuracy preserved under substantial missingness; the Gaza mortality analysis demonstrates the approach yields a lower, yet precise, estimate of deaths and indicates undercount relative to official statistics by about 26%. Overall, the work provides principled tools for robust population-size estimation in conflict settings and other hard-to-reach populations, extending capture-recapture methodology to handle incomplete covariates while maintaining finite-sample guarantees.

Abstract

Population size estimation from capture-recapture data is central for studying hard-to-reach populations, incorporating auxiliary covariates to account for heterogeneous capture probabilities and recapture dependencies. However, missing attributes pose a critical methodological challenge due to reluctance to share sensitive information, data collection limitations, and imperfect record linkage. Existing approaches either ignore missingness or rely on a priori imputation, potentially introducing substantial bias. In this work, we develop a novel nonparametric estimation framework using a Missing at Random assumption to identify capture probabilities under missing covariates. Using semiparametric efficiency theory, we construct one-step estimators that combine efficiency, robustness, and finite-sample validity: they approximately achieve the nonparametric efficiency bound, accommodate flexible machine learning methods through a doubly robust structure, and provide approximately valid inference for any sample size. Simulations demonstrate substantial improvements over naive imputation approaches, with our doubly robust ML estimators maintaining valid inference even at high missingness rates where competing methods fail. We apply our methodology to re-estimate mortality in the Gaza Strip from October 7, 2023, to June 30, 2024, using three-list capture-recapture data with missing demographic information. Our approach yields more conservative yet precise estimates compared to previous methods, indicating the true death toll exceeds official statistics by approximately 26%. Our framework provides practitioners with principled tools for handling incomplete data in conflict settings and other applications with hard-to-reach populations.

Doubly Robust Machine Learning for Population Size Estimation with Missing Covariates: Application to Gaza Conflict Mortality

TL;DR

Abstract

Paper Structure (20 sections, 4 theorems, 35 equations, 2 figures, 2 tables)

This paper contains 20 sections, 4 theorems, 35 equations, 2 figures, 2 tables.

Introduction
Related Work on Population Size Estimation
Classic Capture-Recapture Methods
Covariate-Adjusted Methods
Nonparametric Approaches
Dealing with Missing Data
Preliminaries
Problem Formulation
Identification of the Capture Probability
Notation
Nonparametric Estimation
Influence Function and Efficiency Bound
Estimation Strategies
Plug-in Estimator
One-step ML Estimator
...and 5 more sections

Key Result

Lemma 1

(Identification Result) Assume no highest-order interaction (ass:nhoi) in a conditional log-linear model with Missing at Random covariates (ass:mar). Under positivity conditions, the inverse capture probability $\psi^{-1}$ is identified by where $q_y(v) = \mathbb{Q}(Y=y \mid V=v)$ and $\lambda_x(y,v) = \mathbb{Q}(X=x \mid Y=y, V=v, R=1)$ are both identifiable from the observed data $\mathcal{D}$.

Figures (2)

Figure 1: Average absolute bias (left) and RMSE (right) across missingness rates. MAR-based estimators (orange lines) maintain consistently good performance, while naive imputation approaches (blue lines) show rapidly increasing bias and RMSE as missingness increases. Imputation-based one-step estimator (solid) exhibits particularly poor performance.
Figure 2: Nominal 95% confidence interval coverage rates. MAR-based methods (orange lines) maintain approximately valid inference across all missingness levels. Imputation-based approaches (blue lines) exhibit severe coverage reduction, with the one-step variant showing particularly poor performance. The dotted reference line indicates nominal 95% coverage.

Theorems & Definitions (11)

Lemma 1
Remark 1
Remark 2
Lemma 2
Theorem 1
Remark 3
Theorem 2
Remark 4
proof : Proof of \ref{['lemma:if_psi']}
proof : Proof of \ref{['thm:optimal']}
...and 1 more

Doubly Robust Machine Learning for Population Size Estimation with Missing Covariates: Application to Gaza Conflict Mortality

TL;DR

Abstract

Doubly Robust Machine Learning for Population Size Estimation with Missing Covariates: Application to Gaza Conflict Mortality

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (2)

Theorems & Definitions (11)