Table of Contents
Fetching ...

IVGAE: Handling Incomplete Heterogeneous Data with a Variational Graph Autoencoder

Youran Zhou, Mohamed Reda Bouadjenek, Sunil Aryal%

TL;DR

IVGAE targets imputation for incomplete heterogeneous tabular data under unknown missingness mechanisms. It constructs a bipartite graph of samples and features and employs a dual-decoder variational graph autoencoder, paired with a Transformer-based heterogeneous embedding to jointly reconstruct values and missingness patterns. The method demonstrates robust improvements in reconstruction (AvgErr) and downstream F1 across MCAR, MAR, and MNAR on 16 real-world datasets, with favorable scalability. This mechanism-aware, heterogeneous-capable approach advances imputation robustness for practical, real-world data with mixed feature types.

Abstract

Handling missing data remains a fundamental challenge in real-world tabular datasets, especially when data are heterogeneous with both numerical and categorical features. Existing imputation methods often fail to capture complex structural dependencies and handle heterogeneous data effectively. We present \textbf{IVGAE}, a Variational Graph Autoencoder framework for robust imputation of incomplete heterogeneous data. IVGAE constructs a bipartite graph to represent sample-feature relationships and applies graph representation learning to model structural dependencies. A key innovation is its \textit{dual-decoder architecture}, where one decoder reconstructs feature embeddings and the other models missingness patterns, providing structural priors aware of missing mechanisms. To better encode categorical variables, we introduce a Transformer-based heterogeneous embedding module that avoids high-dimensional one-hot encoding. Extensive experiments on 16 real-world datasets show that IVGAE achieves consistent improvements in RMSE and downstream F1 across MCAR, MAR, and MNAR missing scenarios under 30\% missing rates. Code and data are available at: https://github.com/echoid/IVGAE.

IVGAE: Handling Incomplete Heterogeneous Data with a Variational Graph Autoencoder

TL;DR

IVGAE targets imputation for incomplete heterogeneous tabular data under unknown missingness mechanisms. It constructs a bipartite graph of samples and features and employs a dual-decoder variational graph autoencoder, paired with a Transformer-based heterogeneous embedding to jointly reconstruct values and missingness patterns. The method demonstrates robust improvements in reconstruction (AvgErr) and downstream F1 across MCAR, MAR, and MNAR on 16 real-world datasets, with favorable scalability. This mechanism-aware, heterogeneous-capable approach advances imputation robustness for practical, real-world data with mixed feature types.

Abstract

Handling missing data remains a fundamental challenge in real-world tabular datasets, especially when data are heterogeneous with both numerical and categorical features. Existing imputation methods often fail to capture complex structural dependencies and handle heterogeneous data effectively. We present \textbf{IVGAE}, a Variational Graph Autoencoder framework for robust imputation of incomplete heterogeneous data. IVGAE constructs a bipartite graph to represent sample-feature relationships and applies graph representation learning to model structural dependencies. A key innovation is its \textit{dual-decoder architecture}, where one decoder reconstructs feature embeddings and the other models missingness patterns, providing structural priors aware of missing mechanisms. To better encode categorical variables, we introduce a Transformer-based heterogeneous embedding module that avoids high-dimensional one-hot encoding. Extensive experiments on 16 real-world datasets show that IVGAE achieves consistent improvements in RMSE and downstream F1 across MCAR, MAR, and MNAR missing scenarios under 30\% missing rates. Code and data are available at: https://github.com/echoid/IVGAE.

Paper Structure

This paper contains 38 sections, 17 equations, 6 figures, 6 tables, 1 algorithm.

Figures (6)

  • Figure 1: Encoding strategies for heterogeneous data within a bipartite graph. Top: One-hot encoding expands each categorical feature ($F_2$) into multiple binary nodes, increasing graph size and sparsity. Bottom: The proposed heterogeneous embedding learns compact semantic representations that preserve feature relationships while reducing dimensionality.
  • Figure 2: Overview of the proposed IVGAE framework. The model encodes a bipartite graph representation of the dataset, learns latent node embeddings via variational inference, and reconstructs both the feature values $\hat{\mathbf{X}}$ and the adjacency matrix $\hat{\mathbf{A}}$ through a dual-decoder mechanism. This design enables simultaneous modeling of feature reconstruction and missingness patterns for mechanism-aware imputation.
  • Figure 3: Comparison of AvgErr across different missingness mechanisms (MCAR, MAR, MNAR). Lower values indicate improved reconstruction accuracy.
  • Figure 4: Critical Difference (CD) diagram of average ranks based on AvgErr across all datasets, missingness mechanisms (MCAR, MAR, MNAR), and missing rates. Lower ranks indicate better imputation performance; methods connected by a horizontal line are not significantly different at $\alpha = 0.05$.
  • Figure 5: Runtime scalability of IVGAE and representative baselines across varying sample sizes, feature dimensions, and missing rates. The Y-axis is in log scale. Lower values indicate faster execution.
  • ...and 1 more figures