Table of Contents
Fetching ...

Identifiable Deep Latent Variable Models for MNAR Data

Huiming Xie, Fei Xue, Xiao Wang

Abstract

Missing data is a ubiquitous challenge in data analysis, often leading to biased and inaccurate results. Traditional imputation methods usually assume that the missingness mechanism is missing-at-random (MAR), where the missingness is independent of the missing values themselves. This assumption is frequently violated in real-world scenarios, prompted by recent advances in imputation methods using deep learning to address this challenge. However, these methods neglect the crucial issue of nonparametric identifiability in missing-not-at-random (MNAR) data, which can lead to biased and unreliable results. This paper seeks to bridge this gap by proposing a novel framework based on deep latent variable models for MNAR data. Building on the assumption of conditional no self-censoring given latent variables, we establish the identifiability of the data distribution. This crucial theoretical result guarantees the feasibility of our approach. To effectively estimate unknown parameters, we develop an efficient algorithm utilizing importance-weighted autoencoders. We demonstrate, both theoretically and empirically, that our estimation process accurately recovers the ground-truth joint distribution under specific regularity conditions. Extensive simulation studies and real-world data experiments showcase the advantages of our proposed method compared to various classical and state-of-the-art approaches to missing data imputation.

Identifiable Deep Latent Variable Models for MNAR Data

Abstract

Missing data is a ubiquitous challenge in data analysis, often leading to biased and inaccurate results. Traditional imputation methods usually assume that the missingness mechanism is missing-at-random (MAR), where the missingness is independent of the missing values themselves. This assumption is frequently violated in real-world scenarios, prompted by recent advances in imputation methods using deep learning to address this challenge. However, these methods neglect the crucial issue of nonparametric identifiability in missing-not-at-random (MNAR) data, which can lead to biased and unreliable results. This paper seeks to bridge this gap by proposing a novel framework based on deep latent variable models for MNAR data. Building on the assumption of conditional no self-censoring given latent variables, we establish the identifiability of the data distribution. This crucial theoretical result guarantees the feasibility of our approach. To effectively estimate unknown parameters, we develop an efficient algorithm utilizing importance-weighted autoencoders. We demonstrate, both theoretically and empirically, that our estimation process accurately recovers the ground-truth joint distribution under specific regularity conditions. Extensive simulation studies and real-world data experiments showcase the advantages of our proposed method compared to various classical and state-of-the-art approaches to missing data imputation.

Paper Structure

This paper contains 43 sections, 8 theorems, 110 equations, 7 figures, 7 tables, 1 algorithm.

Key Result

Theorem 1

Suppose that $\mathcal{X}$ is a Riemannian manifold diffeomorphic to $\mathbb{R}^d$, and that $\mu_{gt}$ is a probability measure on $\mathcal{X}$ which has a nowhere vanishing density function $p_{gt}(x)$ with respect to the volume measure.

Figures (7)

  • Figure 1: Left: An example architecture of conditional no self-censoring given latent variables for 3 variables. Right: An example architecture of the traditional no self-censoring independence model for 3 variables.
  • Figure 2: Visualization of generated data from the deep generative models under the setting with latent variables and a linear transformation $g_{\psi_j}$.
  • Figure 3: An illustration of block-wise missingness with $4$ groups for $n$ subjects. Each group is a set of variables either missing together or observed together. Each row represents a subject, that is, a sample of $X$. Here white areas represent blocks of missing values. Different colors represent different missing patterns. Note that subjects are arranged in an order such that they have the same missing pattern. For the sake of brevity, we present only a subset of the missing patterns, omitting the others. (Note that the group sizes are not required to be the same, where the group size means the number of variables within a group.The block-wise missingness with $4$ groups correspond to the case with $p=100$ and $gs=25$, the last column in Table \ref{['highdimsim']} for the experiments.)
  • Figure 4: Estimation of $\mathbb{E}(X_3)$ in the Gaussian mixtures. The dotted line represents the true value of $\mathbb{E}(X_3)$.
  • Figure 5: Selection of latent dimensions for UCI datasets with MNAR missingness. The curves represent the average imputation RMSE across each validation set during 5-fold cross-validation, where the entries to be imputed are MCAR in the validation sets.
  • ...and 2 more figures

Theorems & Definitions (9)

  • Theorem 1
  • Remark
  • Theorem 2: Identifiability of missingness mechanism
  • Corollary 3
  • Theorem 4
  • Proposition 5: Bias and variance of $\hat{\mathcal{L}}_K$
  • Corollary 6: Convergence of $\hat{\mathcal{L}}_K$
  • Lemma 7
  • Theorem 8