Table of Contents
Fetching ...

DiffPuter: Empowering Diffusion Models for Missing Data Imputation

Hengrui Zhang, Liancheng Fang, Qitian Wu, Philip S. Yu

TL;DR

DiffPuter introduces an EM-diffusion framework for missing data imputation in tabular data, where the M-step learns a diffusion-based density $p_{m{\theta}}(\boldsymbol{x})$ and the E-step performs conditional sampling to impute $\boldsymbol{x}^{\rm mis}$ given $\boldsymbol{x}^{\rm obs}$. The authors prove that diffusion training corresponds to maximum likelihood estimation while the diffusion-guided conditional imputation yields an Expectation-A Posteriori update for missing values. Empirical results across multiple datasets show DiffPuter consistently outperforms 17 baselines, achieving notable MAE/RMSE gains and demonstrating robustness in both in-sample and out-of-sample settings. The work offers a scalable, principled approach to missing data imputation with practical impact for data cleaning and downstream modeling.

Abstract

Generative models play an important role in missing data imputation in that they aim to learn the joint distribution of full data. However, applying advanced deep generative models (such as Diffusion models) to missing data imputation is challenging due to 1) the inherent incompleteness of the training data and 2) the difficulty in performing conditional inference from unconditional generative models. To deal with these challenges, this paper introduces DiffPuter, a tailored diffusion model combined with the Expectation-Maximization (EM) algorithm for missing data imputation. DiffPuter iteratively trains a diffusion model to learn the joint distribution of missing and observed data and performs an accurate conditional sampling to update the missing values using a tailored reversed sampling strategy. Our theoretical analysis shows that DiffPuter's training step corresponds to the maximum likelihood estimation of data density (M-step), and its sampling step represents the Expected A Posteriori estimation of missing values (E-step). Extensive experiments across ten diverse datasets and comparisons with 17 different imputation methods demonstrate DiffPuter's superior performance. Notably, DiffPuter achieves an average improvement of 6.94% in MAE and 4.78% in RMSE compared to the most competitive existing method.

DiffPuter: Empowering Diffusion Models for Missing Data Imputation

TL;DR

DiffPuter introduces an EM-diffusion framework for missing data imputation in tabular data, where the M-step learns a diffusion-based density and the E-step performs conditional sampling to impute given . The authors prove that diffusion training corresponds to maximum likelihood estimation while the diffusion-guided conditional imputation yields an Expectation-A Posteriori update for missing values. Empirical results across multiple datasets show DiffPuter consistently outperforms 17 baselines, achieving notable MAE/RMSE gains and demonstrating robustness in both in-sample and out-of-sample settings. The work offers a scalable, principled approach to missing data imputation with practical impact for data cleaning and downstream modeling.

Abstract

Generative models play an important role in missing data imputation in that they aim to learn the joint distribution of full data. However, applying advanced deep generative models (such as Diffusion models) to missing data imputation is challenging due to 1) the inherent incompleteness of the training data and 2) the difficulty in performing conditional inference from unconditional generative models. To deal with these challenges, this paper introduces DiffPuter, a tailored diffusion model combined with the Expectation-Maximization (EM) algorithm for missing data imputation. DiffPuter iteratively trains a diffusion model to learn the joint distribution of missing and observed data and performs an accurate conditional sampling to update the missing values using a tailored reversed sampling strategy. Our theoretical analysis shows that DiffPuter's training step corresponds to the maximum likelihood estimation of data density (M-step), and its sampling step represents the Expected A Posteriori estimation of missing values (E-step). Extensive experiments across ten diverse datasets and comparisons with 17 different imputation methods demonstrate DiffPuter's superior performance. Notably, DiffPuter achieves an average improvement of 6.94% in MAE and 4.78% in RMSE compared to the most competitive existing method.
Paper Structure (34 sections, 2 theorems, 21 equations, 7 figures, 10 tables, 2 algorithms)

This paper contains 34 sections, 2 theorems, 21 equations, 7 figures, 10 tables, 2 algorithms.

Key Result

Theorem 1

Let $\tilde{\mathbf{x}}_{T}$ be a sample from the prior distribution $\pi(\mathbf{x}) = {\mathcal{N}}(\mathbf{0}, \sigma^2(T)\mathbf{I})$, $\mathbf{x}$ be the data to impute, and the known entries of $\mathbf{x}$ are denoted by $\mathbf{x}^{\rm obs} = \hat{\mathbf{x}}_0^{\rm obs}$. The score functio

Figures (7)

  • Figure 1: An overview of the architecture of the proposed DiffPuter. DiffPuter utilizes one-hot encoding to transform discrete variables into continuous ones and use the mean of observed values to initialize the missing entries. The EM algorithm alternates the process of 1) fixing $\mathbf{x}^{\rm mis}$ and estimate diffusion model parameter $\mathbf{\bm{\theta}}$, 2) fixing $\mathbf{\bm{\theta}}$ and estimate $\mathbf{x}^{\rm mis}$, for $K$ iterations. The final imputation result $\mathbf{x}^{*}$ is returned from the E-step of the last iteration.
  • Figure 2: MCAR, In-sample imputation performance on continuous columns: Comparing DiffPuter with 17 baselines on imputing continuous data on all the nine datasets. A blank column indicates that the method fails or gets out-of-memory for that dataset. DiffPuter outperforms the most competitive baseline method by $6.94\%$ (MAE score) and $4.78\%$ (RMSE score) by average. The circled number after the model name denotes its ranking among all methods.
  • Figure 3: Impacts of the number of EM iterations. DiffPuter's performance steadily improves as the number of EM interactions increases.
  • Figure 3: Effects of combining EM with other Deep Generative Models.
  • Figure 4: Impacts of the number of sampled imputations per iteration. A very small $N$ leads to poor performance and large variance.
  • ...and 2 more figures

Theorems & Definitions (6)

  • Remark 1
  • Remark 2
  • Theorem 1
  • proof
  • Lemma 1
  • proof