Table of Contents
Fetching ...

High-dimensional logistic regression with missing data: Imputation, regularization, and universality

Kabir Aladin Verchand, Andrea Montanari

TL;DR

This work analyzes high-dimensional ridge-regularized logistic regression in settings where covariates are missing or corrupted by Gaussian noise. By leveraging Gaussian error-in-variables ensembles and universality principles, it derives exact asymptotic characterizations of prediction and estimation errors and proves that these characterizations hold under broad independence and moment conditions. The authors connect the theory to imputation strategies, showing that single-imputation with ridge regularization can nearly match Bayes-optimal prediction, while prior-imputation may underperform; these insights are corroborated by extensive simulations and non-asymptotic concentration control. The results provide a unified framework to understand the behavior of imputation-based methods in high dimensions and offer practical guidance for model selection and data preprocessing in missing-data scenarios.

Abstract

We study high-dimensional, ridge-regularized logistic regression in a setting in which the covariates may be missing or corrupted by additive noise. When both the covariates and the additive corruptions are independent and normally distributed, we provide exact characterizations of both the prediction error as well as the estimation error. Moreover, we show that these characterizations are universal: as long as the entries of the data matrix satisfy a set of independence and moment conditions, our guarantees continue to hold. Universality, in turn, enables the detailed study of several imputation-based strategies when the covariates are missing completely at random. We ground our study by comparing the performance of these strategies with the conjectured performance -- stemming from replica theory in statistical physics -- of the Bayes optimal procedure. Our analysis yields several insights including: (i) a distinction between single imputation and a simple variant of multiple imputation and (ii) that adding a simple ridge regularization term to single-imputed logistic regression can yield an estimator whose prediction error is nearly indistinguishable from the Bayes optimal prediction error. We supplement our findings with extensive numerical experiments.

High-dimensional logistic regression with missing data: Imputation, regularization, and universality

TL;DR

This work analyzes high-dimensional ridge-regularized logistic regression in settings where covariates are missing or corrupted by Gaussian noise. By leveraging Gaussian error-in-variables ensembles and universality principles, it derives exact asymptotic characterizations of prediction and estimation errors and proves that these characterizations hold under broad independence and moment conditions. The authors connect the theory to imputation strategies, showing that single-imputation with ridge regularization can nearly match Bayes-optimal prediction, while prior-imputation may underperform; these insights are corroborated by extensive simulations and non-asymptotic concentration control. The results provide a unified framework to understand the behavior of imputation-based methods in high dimensions and offer practical guidance for model selection and data preprocessing in missing-data scenarios.

Abstract

We study high-dimensional, ridge-regularized logistic regression in a setting in which the covariates may be missing or corrupted by additive noise. When both the covariates and the additive corruptions are independent and normally distributed, we provide exact characterizations of both the prediction error as well as the estimation error. Moreover, we show that these characterizations are universal: as long as the entries of the data matrix satisfy a set of independence and moment conditions, our guarantees continue to hold. Universality, in turn, enables the detailed study of several imputation-based strategies when the covariates are missing completely at random. We ground our study by comparing the performance of these strategies with the conjectured performance -- stemming from replica theory in statistical physics -- of the Bayes optimal procedure. Our analysis yields several insights including: (i) a distinction between single imputation and a simple variant of multiple imputation and (ii) that adding a simple ridge regularization term to single-imputed logistic regression can yield an estimator whose prediction error is nearly indistinguishable from the Bayes optimal prediction error. We supplement our findings with extensive numerical experiments.
Paper Structure (78 sections, 20 theorems, 257 equations, 7 figures)

This paper contains 78 sections, 20 theorems, 257 equations, 7 figures.

Key Result

Lemma 1

Under Assumption asm:regularity, the asymptotic loss $L$eq:asymploss satisfies the following properties.

Figures (7)

  • Figure 1: A comparison of several different imputation methods in low dimensions ($p = 2$). Triangular marks denote the average over $1000$ independent trials and the shaded regions represent the inter-quartile range. In contrast with the linear model, in which single imputation yields a consistent estimator chandrasekher2020imputation, in the logistic model, single imputation is only able to identify the subspace in which $\boldsymbol{\theta}_0$ lies.
  • Figure 2: A comparison of single imputation and prior imputation in high-dimensions ($p=500, n=1500$). Both are compared with the conjectured Bayes optimal error (see Section \ref{['sec:numerical-illustration']}). Triangular marks denote the average over $1000$ independent trials and shaded regions represent the inter-quartile range.
  • Figure 3: A comparison of the test error of optimally ridge-regularized logistic regression with the (conjectured) Bayes' optimal test error. The probability of an observing an entry is set as $\alpha=0.704$ and the contour plots are generated by numerically evaluating the asymptotic expressions for several values of the parameters $R$ (the radius of the problem) and $\delta$ (the ratio of samples to dimension).
  • Figure 4: A comparison of the Bayes error with the optimally regularized single imputation error. The probability of observing an entry is fixed as $\alpha=0.7$ and the radius of the problem is fixed as $R = 4$, whereas the ratio of samples to dimensions $\delta$ is varied. Triangular marks denote the empirical average of the empirical error, dashed maroon lines (barely visible) denote the exact single imputation error, and dashed black lines denote the Bayes error. The shaded region denotes the inter-quartile range.
  • Figure 5: A comparison of the regularized single imputation and prior imputation errors. The probability of observing an entry is fixed as $\alpha=0.7$, the radius of the problem is fixed as $R=1$, and the ratio of samples to dimensions is fixed as $\delta = 10$. Triangular marks denote averages of the empirical error and solid circles denote exact expressions evaluated via Theorem \ref{['thm:main']}. The shaded regions correspond to inter-quartile ranges.
  • ...and 2 more figures

Theorems & Definitions (25)

  • Definition 1: Gaussian error-in-variables
  • Definition 2: Asymptotic loss
  • Lemma 1
  • Proposition 1
  • Definition 3: $(\alpha_c, \alpha_2)$--universality class
  • Theorem 1
  • Corollary 1
  • Lemma 2
  • Lemma 3
  • Lemma 4
  • ...and 15 more