Table of Contents
Fetching ...

Self-Supervision Improves Diffusion Models for Tabular Data Imputation

Yixin Liu, Thalaiyasingam Ajanthan, Hisham Husain, Vu Nguyen

TL;DR

This work tackles missing data in tabular datasets by adapting diffusion models for imputation. It identifies objective and data-scale mismatches that hinder vanilla diffusion approaches and introduces SimpDM, which combines self-supervised alignment to reduce sensitivity to initial noise and state-dependent augmentation to bolster robustness under limited data. The method uses a hybrid input design, pseudo-missing training, and a two-channel self-supervised objective, with extensions to mixed-type data via multinomial diffusion. Comprehensive experiments across 17 real-world datasets show that SimpDM matches or surpasses state-of-the-art imputation methods, with strong stability across MAR/MNAR scenarios and varying missing ratios, while remaining memory-efficient. The proposed approach advances practical imputation by improving accuracy, robustness, and scalability of diffusion-based tabular imputation.

Abstract

The ubiquity of missing data has sparked considerable attention and focus on tabular data imputation methods. Diffusion models, recognized as the cutting-edge technique for data generation, demonstrate significant potential in tabular data imputation tasks. However, in pursuit of diversity, vanilla diffusion models often exhibit sensitivity to initialized noises, which hinders the models from generating stable and accurate imputation results. Additionally, the sparsity inherent in tabular data poses challenges for diffusion models in accurately modeling the data manifold, impacting the robustness of these models for data imputation. To tackle these challenges, this paper introduces an advanced diffusion model named Self-supervised imputation Diffusion Model (SimpDM for brevity), specifically tailored for tabular data imputation tasks. To mitigate sensitivity to noise, we introduce a self-supervised alignment mechanism that aims to regularize the model, ensuring consistent and stable imputation predictions. Furthermore, we introduce a carefully devised state-dependent data augmentation strategy within SimpDM, enhancing the robustness of the diffusion model when dealing with limited data. Extensive experiments demonstrate that SimpDM matches or outperforms state-of-the-art imputation methods across various scenarios.

Self-Supervision Improves Diffusion Models for Tabular Data Imputation

TL;DR

This work tackles missing data in tabular datasets by adapting diffusion models for imputation. It identifies objective and data-scale mismatches that hinder vanilla diffusion approaches and introduces SimpDM, which combines self-supervised alignment to reduce sensitivity to initial noise and state-dependent augmentation to bolster robustness under limited data. The method uses a hybrid input design, pseudo-missing training, and a two-channel self-supervised objective, with extensions to mixed-type data via multinomial diffusion. Comprehensive experiments across 17 real-world datasets show that SimpDM matches or surpasses state-of-the-art imputation methods, with strong stability across MAR/MNAR scenarios and varying missing ratios, while remaining memory-efficient. The proposed approach advances practical imputation by improving accuracy, robustness, and scalability of diffusion-based tabular imputation.

Abstract

The ubiquity of missing data has sparked considerable attention and focus on tabular data imputation methods. Diffusion models, recognized as the cutting-edge technique for data generation, demonstrate significant potential in tabular data imputation tasks. However, in pursuit of diversity, vanilla diffusion models often exhibit sensitivity to initialized noises, which hinders the models from generating stable and accurate imputation results. Additionally, the sparsity inherent in tabular data poses challenges for diffusion models in accurately modeling the data manifold, impacting the robustness of these models for data imputation. To tackle these challenges, this paper introduces an advanced diffusion model named Self-supervised imputation Diffusion Model (SimpDM for brevity), specifically tailored for tabular data imputation tasks. To mitigate sensitivity to noise, we introduce a self-supervised alignment mechanism that aims to regularize the model, ensuring consistent and stable imputation predictions. Furthermore, we introduce a carefully devised state-dependent data augmentation strategy within SimpDM, enhancing the robustness of the diffusion model when dealing with limited data. Extensive experiments demonstrate that SimpDM matches or outperforms state-of-the-art imputation methods across various scenarios.
Paper Structure (35 sections, 12 equations, 5 figures, 5 tables, 2 algorithms)

This paper contains 35 sections, 12 equations, 5 figures, 5 tables, 2 algorithms.

Figures (5)

  • Figure 1: Motivating experiments on UCI Power dataset. (a) Given a data sample, the imputation results by diffusion model with different Gaussian initialization at the first diffusion step. (b) The imputation performance under different numbers of training samples.
  • Figure 2: The overall pipeline of the training procedure of SimpDM. Given a training sample $\mathbf{x}$ and its missing mask $\mathbf{m}$, the first step is to apply average padding for the missing entries and sample the pseudo mask $\mathbf{m}_p$ and condition mask $\mathbf{m}_c$. In self-supervised alignment, we sample different $t$ and $\epsilon$ at two channels, and then run the diffusion model at each channel. Apart from the diffusion model loss $\mathcal{L}_{dm}$, we use a self-supervised alignment loss $\mathcal{L}_{sa}$ to minimize the distance of the predictions at two channels. We further use a state-dependent augmentation strategy to perturb the model's input according to the states (GT, MS, or PM) of each entry.
  • Figure 3: Imputation performance on MAR and MNAR scenarios.
  • Figure 4: Imputation performance under different missing ratios.
  • Figure 5: (a) Runtime per training epoch on different datasets (with size $n \times d$). (b) The imputation results from different initialization. (c) t-SNE visualization of ground-truth data and imputed data.