Table of Contents
Fetching ...

Boosting prediction with data missing not at random

Yuan Bian, Grace Y. Yi, Wenqing He

TL;DR

The paper tackles boosting under MNAR missing responses by developing two loss-adjustment strategies, inverse propensity weighting and Buckley-James type adjustment, within a semiparametric framework. It constructs consistent estimators for the missing-data components and implements a functional gradient descent boosting algorithm whose convergence and consistency are proven, with key results expressed as $R(f^{(m+1)}) - R(f^*) \le \left(1 - \dfrac{1}{C^* c^*}\right)^m \left(R(f^{(0)}) - R(f^*)\right)$ and $\lim_{n\to\infty} \|\hat{f}_n^{AL} - f^*\|_{\infty}=0$. Through simulations and KLIPS data, the methods show competitive finite-sample performance under MAR and MNAR settings, and robust sensitivity checks by comparing IPW, BJ, and naive approaches. The work provides a practical approach to predictive modeling when the response is MNAR and highlights identifiability considerations as central to validity.

Abstract

Boosting has emerged as a useful machine learning technique over the past three decades, attracting increased attention. Most advancements in this area, however, have primarily focused on numerical implementation procedures, often lacking rigorous theoretical justifications. Moreover, these approaches are generally designed for datasets with fully observed data, and their validity can be compromised by the presence of missing observations. In this paper, we employ semiparametric estimation approaches to develop boosting prediction methods for data with missing responses. We explore two strategies for adjusting the loss functions to account for missingness effects. The proposed methods are implemented using a functional gradient descent algorithm, and their theoretical properties, including algorithm convergence and estimator consistency, are rigorously established. Numerical studies demonstrate that the proposed methods perform well in finite sample settings.

Boosting prediction with data missing not at random

TL;DR

The paper tackles boosting under MNAR missing responses by developing two loss-adjustment strategies, inverse propensity weighting and Buckley-James type adjustment, within a semiparametric framework. It constructs consistent estimators for the missing-data components and implements a functional gradient descent boosting algorithm whose convergence and consistency are proven, with key results expressed as and . Through simulations and KLIPS data, the methods show competitive finite-sample performance under MAR and MNAR settings, and robust sensitivity checks by comparing IPW, BJ, and naive approaches. The work provides a practical approach to predictive modeling when the response is MNAR and highlights identifiability considerations as central to validity.

Abstract

Boosting has emerged as a useful machine learning technique over the past three decades, attracting increased attention. Most advancements in this area, however, have primarily focused on numerical implementation procedures, often lacking rigorous theoretical justifications. Moreover, these approaches are generally designed for datasets with fully observed data, and their validity can be compromised by the presence of missing observations. In this paper, we employ semiparametric estimation approaches to develop boosting prediction methods for data with missing responses. We explore two strategies for adjusting the loss functions to account for missingness effects. The proposed methods are implemented using a functional gradient descent algorithm, and their theoretical properties, including algorithm convergence and estimator consistency, are rigorously established. Numerical studies demonstrate that the proposed methods perform well in finite sample settings.

Paper Structure

This paper contains 20 sections, 4 theorems, 40 equations, 8 figures, 1 algorithm.

Key Result

Proposition 1

The proposed adjusted loss functions eq: ipw and eq: BJ have the same expectation as $L(Y_i,f(X_i))$. That is, where the expectations are evaluated with respect to the joint distribution of the associated random variables.

Figures (8)

  • Figure 1: Prediction assessments in the MAR scenario. Top and bottom rows correspond to Settings 1 and 2, respectively; left and right columns correspond to the values for S-MAE and S-RMSE, respectively.
  • Figure 2: Prediction assessments in the MNAR scenario. Top and bottom rows correspond to Settings 1 and 2, respectively; left and right columns correspond to the values for S-MAE and S-RMSE, respectively.
  • Figure 3: Boxplots of ${n_2}^{-1}\sum^{n_2}_{i=1}\hat{f}^*_{n_1}(X_i)$ obtained from the five methods in combination with the three loss functions, where the dashed line indicates the value of $E(Y_i)$. Top and bottom rows correspond to Settings 1 and 2, respectively; left and right columns correspond to the MAR and MNAR settings, respectively.
  • Figure 4: Performance under model misspecification in the MAR scenario. Top and bottom rows correspond to Settings 1 and 2, respectively; left and right columns correspond to the values for S-MAE and S-RMSE, respectively.
  • Figure 5: Performance under model misspecification in the MNAR scenario. Top and bottom rows correspond to Settings 1 and 2, respectively; left and right columns correspond to the values for S-MAE and S-RMSE, respectively.
  • ...and 3 more figures

Theorems & Definitions (4)

  • Proposition 1
  • Proposition 2
  • Theorem 1
  • Theorem 2