Table of Contents
Fetching ...

A Bernstein polynomial approach for the estimation of cumulative distribution functions in the presence of missing data

Rihab Gharbi, Wissem Jedidi, Salah Khardani, Frédéric Ouimet

Abstract

We study nonparametric estimation of univariate cumulative distribution functions (CDFs) pertaining to data missing at random. The proposed estimators smooth the inverse probability weighted (IPW) empirical CDF with the Bernstein operator, yielding monotone, $[0,1]$-valued curves that automatically adapt to bounded supports. We analyze two versions: a pseudo estimator that uses known propensities and a feasible estimator that uses propensities estimated nonparametrically from discrete auxiliary variables, the latter scenario being much more common in practice. For both, we derive pointwise bias and variance expansions, establish the optimal polynomial degree $m$ with respect to the mean integrated squared error, and prove the asymptotic normality. A key finding is that the feasible estimator has a smaller variance than the pseudo estimator by an explicit nonnegative correction term. We also develop an efficient degree selection procedure via least-squares cross-validation. Monte Carlo experiments demonstrate that, for moderate to large sample sizes, the Bernstein-smoothed feasible estimator outperforms both its unsmoothed counterpart and an integrated version of the IPW kernel density estimator proposed by Dubnicka (2009) in the same context. A real-data application to fasting plasma glucose from the 2017-2018 NHANES survey illustrates the method in a practical setting. All code needed to reproduce our analyses is readily accessible on GitHub.

A Bernstein polynomial approach for the estimation of cumulative distribution functions in the presence of missing data

Abstract

We study nonparametric estimation of univariate cumulative distribution functions (CDFs) pertaining to data missing at random. The proposed estimators smooth the inverse probability weighted (IPW) empirical CDF with the Bernstein operator, yielding monotone, -valued curves that automatically adapt to bounded supports. We analyze two versions: a pseudo estimator that uses known propensities and a feasible estimator that uses propensities estimated nonparametrically from discrete auxiliary variables, the latter scenario being much more common in practice. For both, we derive pointwise bias and variance expansions, establish the optimal polynomial degree with respect to the mean integrated squared error, and prove the asymptotic normality. A key finding is that the feasible estimator has a smaller variance than the pseudo estimator by an explicit nonnegative correction term. We also develop an efficient degree selection procedure via least-squares cross-validation. Monte Carlo experiments demonstrate that, for moderate to large sample sizes, the Bernstein-smoothed feasible estimator outperforms both its unsmoothed counterpart and an integrated version of the IPW kernel density estimator proposed by Dubnicka (2009) in the same context. A real-data application to fasting plasma glucose from the 2017-2018 NHANES survey illustrates the method in a practical setting. All code needed to reproduce our analyses is readily accessible on GitHub.

Paper Structure

This paper contains 23 sections, 10 theorems, 105 equations, 1 figure, 3 tables.

Key Result

Proposition 1

Suppose that Assumption ass:3 holds. Then, uniformly for $y\in [0,1]$, we have, as $n\to \infty$, where

Figures (1)

  • Figure 1: Feasible CDF of fasting plasma glucose: unsmoothed IPW versus Bernstein-smoothed (LSCV-chosen degree $m^*$). The horizontal axis only shows $[0.05,0.40]$ for a clearer view.

Theorems & Definitions (16)

  • Proposition 1: Bias
  • Proposition 2: Variance
  • Corollary 3: Mean squared error
  • Corollary 4: Mean integrated squared error
  • Theorem 5: Asymptotic normality
  • Proposition 6: Bias
  • Proposition 7: Variance
  • Corollary 8: Mean squared error
  • Corollary 9: Mean integrated squared error
  • Theorem 10: Asymptotic normality
  • ...and 6 more