Table of Contents
Fetching ...

Interpretable Graph Neural Networks for Heterogeneous Tabular Data

Amr Alkhatib, Henrik Boström

TL;DR

IGNH introduces an interpretable graph neural network tailored for heterogeneous tabular data, delivering exact feature attributions alongside predictions. By representing each data point as a feature-graph with edges derived from statistically significant correlations, IGNH jointly handles numerical and categorical features through dedicated embeddings and a dominance of self-looped message passing, with an injective readout enabling per-feature contributions. Empirical results across 30 datasets show synthetic explanations align with post-hoc Shapley values via KernelSHAP and demonstrate competitive predictive performance versus XGBoost, Random Forests, and TabNet, particularly on datasets rich in categorical features. The work advances trustworthy tabular modeling by providing transparent, high-performing models and suggests future extensions to non-tabular data and user-focused explanations.

Abstract

Many machine learning algorithms for tabular data produce black-box models, which prevent users from understanding the rationale behind the model predictions. In their unconstrained form, graph neural networks fall into this category, and they have further limited abilities to handle heterogeneous data. To overcome these limitations, an approach is proposed, called IGNH (Interpretable Graph Neural Network for Heterogeneous tabular data), which handles both categorical and numerical features, while constraining the learning process to generate exact feature attributions together with the predictions. A large-scale empirical investigation is presented, showing that the feature attributions provided by IGNH align with Shapley values that are computed post hoc. Furthermore, the results show that IGNH outperforms two powerful machine learning algorithms for tabular data, Random Forests and TabNet, while reaching a similar level of performance as XGBoost.

Interpretable Graph Neural Networks for Heterogeneous Tabular Data

TL;DR

IGNH introduces an interpretable graph neural network tailored for heterogeneous tabular data, delivering exact feature attributions alongside predictions. By representing each data point as a feature-graph with edges derived from statistically significant correlations, IGNH jointly handles numerical and categorical features through dedicated embeddings and a dominance of self-looped message passing, with an injective readout enabling per-feature contributions. Empirical results across 30 datasets show synthetic explanations align with post-hoc Shapley values via KernelSHAP and demonstrate competitive predictive performance versus XGBoost, Random Forests, and TabNet, particularly on datasets rich in categorical features. The work advances trustworthy tabular modeling by providing transparent, high-performing models and suggests future extensions to non-tabular data and user-focused explanations.

Abstract

Many machine learning algorithms for tabular data produce black-box models, which prevent users from understanding the rationale behind the model predictions. In their unconstrained form, graph neural networks fall into this category, and they have further limited abilities to handle heterogeneous data. To overcome these limitations, an approach is proposed, called IGNH (Interpretable Graph Neural Network for Heterogeneous tabular data), which handles both categorical and numerical features, while constraining the learning process to generate exact feature attributions together with the predictions. A large-scale empirical investigation is presented, showing that the feature attributions provided by IGNH align with Shapley values that are computed post hoc. Furthermore, the results show that IGNH outperforms two powerful machine learning algorithms for tabular data, Random Forests and TabNet, while reaching a similar level of performance as XGBoost.
Paper Structure (17 sections, 5 equations, 6 figures, 3 tables, 1 algorithm)

This paper contains 17 sections, 5 equations, 6 figures, 3 tables, 1 algorithm.

Figures (6)

  • Figure 1: An overview of the proposed approach. Each data example is represented as a graph. The features of the data instance are the nodes and the edges between nodes are the correlation between features. Multiple iterations of message passing are applied. Finally, the obtained node representations are projected using an injective mapping function into scalar values, and the graph representation is obtained by concatenating the projected values and used for prediction.
  • Figure 2: Comparison between the approximations generated by KernelSHAP and the importance scores obtained from IGNH. We assess the similarity of KernelSHAP's approximations to the scores produced by IGNH during the iterations of data sampling and evaluation by KernelSHAP. It becomes evident that KernelSHAP shows improved accuracy in approximating the scores derived from IGNH with further data sampling.
  • Figure 3: Explanation to a single prediction on the Numerai 28.6 dataset.
  • Figure 4: The average rank of the compared algorithms over the 30 datasets with respect to the AUC, where a lower rank is better. The critical difference (CD) shows the biggest difference that is not statistically significant.
  • Figure 5: Comparison between the approximations generated by KernelSHAP and the importance scores obtained from IGNetH. We assess the similarity of KernelSHAP's approximations to the scores produced by IGNetH during the iterations of data sampling and evaluation by KernelSHAP. It becomes evident that KernelSHAP shows improved accuracy in approximating the scores derived from IGNetH with further data sampling.
  • ...and 1 more figures