Data Imputation with Iterative Graph Reconstruction
Jiajun Zhong, Weiwei Ye, Ning Gui
TL;DR
IGRM tackles missing data in tabular sets by introducing a learnable friend network that differentiates sample relevance during imputation. The method jointly optimizes the friend network and a bipartite graph imputation model in an end-to-end loop, using differentiable structure augmentation and sample-embedding guidance to improve information flow. It achieves substantial MAE gains on eight real-world datasets, including a 39.13% reduction versus baselines and 9.04% against the second-best at 30% missing, and demonstrates robust performance across varying missing ratios. This work highlights the practical value of encoding sample-sample relations via a learned friend network to enhance graph-based imputation pipelines.
Abstract
Effective data imputation demands rich latent ``structure" discovery capabilities from ``plain" tabular data. Recent advances in graph neural networks-based data imputation solutions show their strong structure learning potential by directly translating tabular data as bipartite graphs. However, due to a lack of relations between samples, those solutions treat all samples equally which is against one important observation: ``similar sample should give more information about missing values." This paper presents a novel Iterative graph Generation and Reconstruction framework for Missing data imputation(IGRM). Instead of treating all samples equally, we introduce the concept: ``friend networks" to represent different relations among samples. To generate an accurate friend network with missing data, an end-to-end friend network reconstruction solution is designed to allow for continuous friend network optimization during imputation learning. The representation of the optimized friend network, in turn, is used to further optimize the data imputation process with differentiated message passing. Experiment results on eight benchmark datasets show that IGRM yields 39.13% lower mean absolute error compared with nine baselines and 9.04% lower than the second-best. Our code is available at https://github.com/G-AILab/IGRM.
