Table of Contents
Fetching ...

A Fair Comparison of Graph Neural Networks for Graph Classification

Federico Errica, Marco Podda, Davide Bacciu, Alessio Micheli

TL;DR

The paper tackles reproducibility and fair benchmarking in graph classification by re-evaluating five prominent GNN architectures across nine datasets within a strict model selection and assessment framework. It introduces structure-agnostic baselines to disentangle the contribution of graph topology from node features and analyzes the impact of including node degree as a feature. The authors conduct an extensive, cross-dataset study (over 47k training runs) and reveal that, on some chemical datasets, topological information has not been effectively exploited by current GNNs, while degree features can substantially boost performance on social graphs. They provide precomputed data splits and code to support transparent, reproducible comparisons and advocate for standardized evaluation practices in the graph learning community.

Abstract

Experimental reproducibility and replicability are critical topics in machine learning. Authors have often raised concerns about their lack in scientific publications to improve the quality of the field. Recently, the graph representation learning field has attracted the attention of a wide research community, which resulted in a large stream of works. As such, several Graph Neural Network models have been developed to effectively tackle graph classification. However, experimental procedures often lack rigorousness and are hardly reproducible. Motivated by this, we provide an overview of common practices that should be avoided to fairly compare with the state of the art. To counter this troubling trend, we ran more than 47000 experiments in a controlled and uniform framework to re-evaluate five popular models across nine common benchmarks. Moreover, by comparing GNNs with structure-agnostic baselines we provide convincing evidence that, on some datasets, structural information has not been exploited yet. We believe that this work can contribute to the development of the graph learning field, by providing a much needed grounding for rigorous evaluations of graph classification models.

A Fair Comparison of Graph Neural Networks for Graph Classification

TL;DR

The paper tackles reproducibility and fair benchmarking in graph classification by re-evaluating five prominent GNN architectures across nine datasets within a strict model selection and assessment framework. It introduces structure-agnostic baselines to disentangle the contribution of graph topology from node features and analyzes the impact of including node degree as a feature. The authors conduct an extensive, cross-dataset study (over 47k training runs) and reveal that, on some chemical datasets, topological information has not been effectively exploited by current GNNs, while degree features can substantially boost performance on social graphs. They provide precomputed data splits and code to support transparent, reproducible comparisons and advocate for standardized evaluation practices in the graph learning community.

Abstract

Experimental reproducibility and replicability are critical topics in machine learning. Authors have often raised concerns about their lack in scientific publications to improve the quality of the field. Recently, the graph representation learning field has attracted the attention of a wide research community, which resulted in a large stream of works. As such, several Graph Neural Network models have been developed to effectively tackle graph classification. However, experimental procedures often lack rigorousness and are hardly reproducible. Motivated by this, we provide an overview of common practices that should be avoided to fairly compare with the state of the art. To counter this troubling trend, we ran more than 47000 experiments in a controlled and uniform framework to re-evaluate five popular models across nine common benchmarks. Moreover, by comparing GNNs with structure-agnostic baselines we provide convincing evidence that, on some datasets, structural information has not been exploited yet. We believe that this work can contribute to the development of the graph learning field, by providing a much needed grounding for rigorous evaluations of graph classification models.

Paper Structure

This paper contains 35 sections, 2 figures, 9 tables, 2 algorithms.

Figures (2)

  • Figure 1: Chemical and social (with degree) benchmark results are shown together with published results (when available). For each of them, we report validation and test accuracies of the evaluated models, together with published results if available.
  • Figure 2: We give a visual representation of the evaluation framework. We apply an external $k_{out}$-fold CV to get an estimate of the generalization performance of a model, and we use an hold-out technique (bottom-left) to select the best hyper-parametres. For completeness, we show that it is also possible to apply an inner $k_{inn}$-fold CV (implementing a complete Nested Cross Validation), which obviously amounts to multiplying the computational costs of model selection by a factor $k_{inn}$.