Rethinking the Diffusion Models for Numerical Tabular Data Imputation from the Perspective of Wasserstein Gradient Flow

Zhichao Chen; Haoxuan Li; Fangyikang Wang; Odin Zhang; Hu Xu; Xiaoyu Jiang; Zhihuan Song; Eric H. Wang

Rethinking the Diffusion Models for Numerical Tabular Data Imputation from the Perspective of Wasserstein Gradient Flow

Zhichao Chen, Haoxuan Li, Fangyikang Wang, Odin Zhang, Hu Xu, Xiaoyu Jiang, Zhihuan Song, Eric H. Wang

TL;DR

A novel principled approach termed Kernelized Negative Entropy-regularized Wasserstein gradient flow Imputation (KnewImp), which proves that the imputation procedure of KnewImp can be derived from another cost functional related to the joint distribution, eliminating the need for the mask matrix and hence naturally addressing issue (2).

Abstract

Diffusion models (DMs) have gained attention in Missing Data Imputation (MDI), but there remain two long-neglected issues to be addressed: (1). Inaccurate Imputation, which arises from inherently sample-diversification-pursuing generative process of DMs. (2). Difficult Training, which stems from intricate design required for the mask matrix in model training stage. To address these concerns within the realm of numerical tabular datasets, we introduce a novel principled approach termed Kernelized Negative Entropy-regularized Wasserstein gradient flow Imputation (KnewImp). Specifically, based on Wasserstein gradient flow (WGF) framework, we first prove that issue (1) stems from the cost functionals implicitly maximized in DM-based MDI are equivalent to the MDI's objective plus diversification-promoting non-negative terms. Based on this, we then design a novel cost functional with diversification-discouraging negative entropy and derive our KnewImp approach within WGF framework and reproducing kernel Hilbert space. After that, we prove that the imputation procedure of KnewImp can be derived from another cost functional related to the joint distribution, eliminating the need for the mask matrix and hence naturally addressing issue (2). Extensive experiments demonstrate that our proposed KnewImp approach significantly outperforms existing state-of-the-art methods.

Rethinking the Diffusion Models for Numerical Tabular Data Imputation from the Perspective of Wasserstein Gradient Flow

TL;DR

Abstract

Paper Structure (38 sections, 11 theorems, 53 equations, 13 figures, 7 tables, 4 algorithms)

This paper contains 38 sections, 11 theorems, 53 equations, 13 figures, 7 tables, 4 algorithms.

Introduction
Preliminaries
Missing Data Imputation
Diffusion Models and its application for Missing Data Imputation
Wasserstein Gradient Flow
Proposed Approach
Unifying DM-based MDI within WGF framework
Negative Entropy Regularized & Closed-form Velocity Field Expression
Modeling ${p}( \boldsymbol{X}^{(\text{miss})}\vert \boldsymbol{X}^{(\text{obs})})$ by ${p}( \boldsymbol{X}^{(\text{miss})}, \boldsymbol{X}^{(\text{obs})})$
Overall Architecture of KnewImp
Experiments
Experimental Setup
Baseline Comparison Results
Ablation Study Results
Sensitivity Analysis
...and 23 more sections

Key Result

Proposition 3.1

Within WGF framework, DM-based MDI approaches can be viewed as finding the imputed values $\boldsymbol{X}^{(\text{imp})}$ that maximize the following objective: where const is the abbreviation of constant, and $\psi(\boldsymbol{X}^{(\text{miss})})$ is a scalar function determined by the type of SDE underlying the DMs. It is important to note that in DMs, the condition $\psi(\boldsymbol{X}^{(\tex

Figures (13)

Figure 1: The illustration of KnewImp. The left part indicates we impute the missing value by WGF, and the right part indicates we use DSM to estimate $\log{p(\boldsymbol{X}^{\text{(miss)}})}$.
Figure 2: Parameter sensitivity of KnewImp on bandwidth for kernel function ($h$), hidden unit of score network $\text{HU}_{\text{score}}$, NER weight $\lambda$, and discretization step $\eta$ for \ref{['eq:jointVelocityField']} on CC dataset. Mean values and one standard deviations from mean are represented by scatters and shaded area, respectively.
Figure E.1: Parameter sensitivity of KnewImp on bandwidth for kernel function ($h$), hidden unit of score network $\text{HU}_{\text{score}}$, NER weight $\lambda$, and discretization step $\eta$ for \ref{['eq:jointVelocityField']} on CC dataset. Mean values and one standard deviations from mean are represented by scatters and shaded area, respectively.
Figure E.2: Average computation time. The scatters and shaded areas indicate the mean and one standard deviation from the mean, respectively.
Figure E.3: Evolution of evaluation metrics along iteration time $\tau$ under MAR scenario at 30% missing rate. The shaded area indicates the $\pm$ 1.0 standard deviation uncertainty interval.
...and 8 more figures

Theorems & Definitions (18)

Proposition 3.1
Proposition 3.2
Proposition 3.3
Proposition 3.4
Corollary 3.5
Proposition : \ref{['prop:inEffectiveSampling']}
proof
Proposition : \ref{['prop:simulationLoss']}
proof
Proposition : \ref{['prop:steinMap']}
...and 8 more

Rethinking the Diffusion Models for Numerical Tabular Data Imputation from the Perspective of Wasserstein Gradient Flow

TL;DR

Abstract

Rethinking the Diffusion Models for Numerical Tabular Data Imputation from the Perspective of Wasserstein Gradient Flow

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (13)

Theorems & Definitions (18)