Table of Contents
Fetching ...

Enhancing Missing Data Imputation through Combined Bipartite Graph and Complete Directed Graph

Zhaoyang Zhang, Hongtu Zhu, Ziqi Chen, Yingjie Zhang, Hai Shu

TL;DR

This paper introduces a novel framework named the Bipartite and Complete Directed Graph Neural Network (BCGNN), and confirms that an in-depth grasp of the interdependence structure substantially enhances the model's feature embedding ability.

Abstract

In this paper, we aim to address a significant challenge in the field of missing data imputation: identifying and leveraging the interdependencies among features to enhance missing data imputation for tabular data. We introduce a novel framework named the Bipartite and Complete Directed Graph Neural Network (BCGNN). Within BCGNN, observations and features are differentiated as two distinct node types, and the values of observed features are converted into attributed edges linking them. The bipartite segment of our framework inductively learns embedding representations for nodes, efficiently utilizing the comprehensive information encapsulated in the attributed edges. In parallel, the complete directed graph segment adeptly outlines and communicates the complex interdependencies among features. When compared to contemporary leading imputation methodologies, BCGNN consistently outperforms them, achieving a noteworthy average reduction of 15% in mean absolute error for feature imputation tasks under different missing mechanisms. Our extensive experimental investigation confirms that an in-depth grasp of the interdependence structure substantially enhances the model's feature embedding ability. We also highlight the model's superior performance in label prediction tasks involving missing data, and its formidable ability to generalize to unseen data points.

Enhancing Missing Data Imputation through Combined Bipartite Graph and Complete Directed Graph

TL;DR

This paper introduces a novel framework named the Bipartite and Complete Directed Graph Neural Network (BCGNN), and confirms that an in-depth grasp of the interdependence structure substantially enhances the model's feature embedding ability.

Abstract

In this paper, we aim to address a significant challenge in the field of missing data imputation: identifying and leveraging the interdependencies among features to enhance missing data imputation for tabular data. We introduce a novel framework named the Bipartite and Complete Directed Graph Neural Network (BCGNN). Within BCGNN, observations and features are differentiated as two distinct node types, and the values of observed features are converted into attributed edges linking them. The bipartite segment of our framework inductively learns embedding representations for nodes, efficiently utilizing the comprehensive information encapsulated in the attributed edges. In parallel, the complete directed graph segment adeptly outlines and communicates the complex interdependencies among features. When compared to contemporary leading imputation methodologies, BCGNN consistently outperforms them, achieving a noteworthy average reduction of 15% in mean absolute error for feature imputation tasks under different missing mechanisms. Our extensive experimental investigation confirms that an in-depth grasp of the interdependence structure substantially enhances the model's feature embedding ability. We also highlight the model's superior performance in label prediction tasks involving missing data, and its formidable ability to generalize to unseen data points.

Paper Structure

This paper contains 32 sections, 19 equations, 10 figures, 3 tables, 1 algorithm.

Figures (10)

  • Figure 1: Comparison of GRAPE and IGRM on the Energy dataset in terms of test MAE in feature imputation (Left); Visualizations of the original and the imputed (by IGRM) Energy datasets using t-SNE for dimensionality reduction (Middle and Right). The poor performance of IGRM can be attributed to erroneously identifying similarities among samples from different classes, leading to indistinguishable sample representations across classes.
  • Figure 2: Flowchart of our BCGNN method. BCGNN consists of a bipartite graph and a complete directed graph constructed from the data matrix and correlation coefficient matrix of features. The working mechanisms of the bipartite graph (Red Dot-Dashed Box), the complete directed graph (Blue Dot-Dashed Box) and the union of two subgraphs are elaborated in detail in Sections \ref{['bipartite']}, \ref{['Complete']} and \ref{['union']}, respectively. With the constructed graph, the feature imputation problem and the label prediction problem are treated as edge-level and node-level prediction tasks, respectively (Black Dot-Dashed Box).
  • Figure 3: Average test MAE of feature imputation at a missing rate of 0.3 under MCAR (Left) and MNAR (Right) in UCI datasets over 5 random trials. The results are normalized by the average performance of the Mean imputation.
  • Figure 4: Left: Average test MAE of feature imputation under MCAR, MAR and MNAR with different missing rates in the Concrete and Energy datasets over 5 random trials. Right: The embedding spaces $V_F$ and $V_O$ for feature and observation nodes, respectively. They are obtained from the trained BCGNN with/without learning interdependence structure in Concrete and Energy under MAR with a missing rate of 0.3. The colored dots represent feature node embeddings and the grey dots represent observation node embeddings.
  • Figure 5: Average test MAE of feature imputation at a missing rate of 0.3 under MAR in UCI datasets over 5 random trials. The results are normalized by the average performance of the Mean imputation.
  • ...and 5 more figures