Table of Contents
Fetching ...

Amortized Variational Inference for Logistic Regression with Missing Covariates

M. Cherifi, Aude Sportisse, Xujia Zhu, Mohammed Nabil El Korso, A. Mesloub

Abstract

Missing covariate data pose a significant challenge to statistical inference and machine learning, particularly for classification tasks like logistic regression. Classical iterative approaches (EM, multiple imputation) are often computationally intensive, sensitive to high missingness rates, and limited in uncertainty propagation. Recent deep generative models based on VAEs show promise but rely on complex latent representations. We propose Amortized Variational Inference for Logistic Regression (AV-LR), a unified end-to-end framework for binary logistic regression with missing covariates. AV-LR integrates a probabilistic generative model with a simple amortized inference network, trained jointly by maximizing the evidence lower bound. Unlike competing methods, AV-LR performs inference directly in the space of missing data without additional latent variables, using a single inference network and a linear layer that jointly estimate regression parameters and the missingness mechanism. AV-LR achieves estimation accuracy comparable to or better than state-of-the-art EM-like algorithms, with significantly lower computational cost. It naturally extends to missing-not-at-random settings by explicitly modeling the missingness mechanism. Empirical results on synthetic and real-world datasets confirm its effectiveness and efficiency across various missing-data scenarios.

Amortized Variational Inference for Logistic Regression with Missing Covariates

Abstract

Missing covariate data pose a significant challenge to statistical inference and machine learning, particularly for classification tasks like logistic regression. Classical iterative approaches (EM, multiple imputation) are often computationally intensive, sensitive to high missingness rates, and limited in uncertainty propagation. Recent deep generative models based on VAEs show promise but rely on complex latent representations. We propose Amortized Variational Inference for Logistic Regression (AV-LR), a unified end-to-end framework for binary logistic regression with missing covariates. AV-LR integrates a probabilistic generative model with a simple amortized inference network, trained jointly by maximizing the evidence lower bound. Unlike competing methods, AV-LR performs inference directly in the space of missing data without additional latent variables, using a single inference network and a linear layer that jointly estimate regression parameters and the missingness mechanism. AV-LR achieves estimation accuracy comparable to or better than state-of-the-art EM-like algorithms, with significantly lower computational cost. It naturally extends to missing-not-at-random settings by explicitly modeling the missingness mechanism. Empirical results on synthetic and real-world datasets confirm its effectiveness and efficiency across various missing-data scenarios.
Paper Structure (22 sections, 33 equations, 7 figures, 14 tables)

This paper contains 22 sections, 33 equations, 7 figures, 14 tables.

Figures (7)

  • Figure 1: Structural causal graph of AV-LR for MCAR or MAR covariates. Nodes in grey represent fully observed variables, nodes in white denote unobserved variables, and mixed nodes indicate the presence of both cases. he edges from $x$ to $y$ means that $x$ causes $y$.
  • Figure 2: Structural causal graph of AV-LR for MNAR covariates. Nodes in grey represent fully observed variables, nodes in white denote unobserved variables, and mixed nodes indicate the presence of both cases. he edges from $x$ to $y$ means that $x$ causes $y$.
  • Figure 3: Evolution of classification metrics (accuracy, AUC, and Brier score) over epochs. The solid line corresponds to AV‑LR (non‑ignorable extension) and the dashed line to AV‑LR (ignorable). Colors denote missingness mechanisms: blue for 50% MCAR, orange for 60% MAR, and green for 60% MNAR. Synthetic data were generated with $(n,p)=(5000,5)$ for training and $(1000,5)$ for testing.
  • Figure 4: Comparative performance of imputation methods on the Bank Note Authentication Dataset under 50% MNAR across Self-Masking, Logistic, and Sequential Logistic mechanisms.
  • Figure 5: Comparative performance of imputation methods on the Pima Indians Diabetes Database under 50% MNAR across Self-Masking, Logistic, and Sequential Logistic mechanisms.
  • ...and 2 more figures