Table of Contents
Fetching ...

Reproducibility and Geometric Intrinsic Dimensionality: An Investigation on Graph Neural Network Research

Tobias Hille, Maximilian Stubbemann, Tom Hanika

TL;DR

The paper tackles reproducibility in graph neural network research by introducing a formal reproducibility ontology and applying it to six reproduced GNN results. It also investigates how geometric intrinsic dimensionality (ID) of data influences model performance, proposing ID-based feature selection to probe robustness. Key findings show reproducibility varies across data, software, and results with documentation and dependencies as primary bottlenecks, while ID can meaningfully alter performance in a method-dependent way. The work provides practical recommendations for reproducible ML practices and highlights the need to consider data geometry when comparing GNNs in high-dimensional settings.

Abstract

Difficulties in replication and reproducibility of empirical evidences in machine learning research have become a prominent topic in recent years. Ensuring that machine learning research results are sound and reliable requires reproducibility, which verifies the reliability of research findings using the same code and data. This promotes open and accessible research, robust experimental workflows, and the rapid integration of new findings. Evaluating the degree to which research publications support these different aspects of reproducibility is one goal of the present work. For this we introduce an ontology of reproducibility in machine learning and apply it to methods for graph neural networks. Building on these efforts we turn towards another critical challenge in machine learning, namely the curse of dimensionality, which poses challenges in data collection, representation, and analysis, making it harder to find representative data and impeding the training and inference processes. Using the closely linked concept of geometric intrinsic dimension we investigate to which extend the used machine learning models are influenced by the intrinsic dimension of the data sets they are trained on.

Reproducibility and Geometric Intrinsic Dimensionality: An Investigation on Graph Neural Network Research

TL;DR

The paper tackles reproducibility in graph neural network research by introducing a formal reproducibility ontology and applying it to six reproduced GNN results. It also investigates how geometric intrinsic dimensionality (ID) of data influences model performance, proposing ID-based feature selection to probe robustness. Key findings show reproducibility varies across data, software, and results with documentation and dependencies as primary bottlenecks, while ID can meaningfully alter performance in a method-dependent way. The work provides practical recommendations for reproducible ML practices and highlights the need to consider data geometry when comparing GNNs in high-dimensional settings.

Abstract

Difficulties in replication and reproducibility of empirical evidences in machine learning research have become a prominent topic in recent years. Ensuring that machine learning research results are sound and reliable requires reproducibility, which verifies the reliability of research findings using the same code and data. This promotes open and accessible research, robust experimental workflows, and the rapid integration of new findings. Evaluating the degree to which research publications support these different aspects of reproducibility is one goal of the present work. For this we introduce an ontology of reproducibility in machine learning and apply it to methods for graph neural networks. Building on these efforts we turn towards another critical challenge in machine learning, namely the curse of dimensionality, which poses challenges in data collection, representation, and analysis, making it harder to find representative data and impeding the training and inference processes. Using the closely linked concept of geometric intrinsic dimension we investigate to which extend the used machine learning models are influenced by the intrinsic dimension of the data sets they are trained on.
Paper Structure (41 sections, 14 equations, 9 figures, 6 tables)

This paper contains 41 sections, 14 equations, 9 figures, 6 tables.

Figures (9)

  • Figure 1: Top levels of the reproducibility ontology.
  • Figure 2: Visualization to get an overview of the distribution of the selected papers.
  • Figure 3: Influence of Intrinsic Dimension measured through feature selection for the GCN results.
  • Figure 4: Influence of Intrinsic Dimension measured through feature selection for the SAGN+SLE results.
  • Figure 5: Overview of evaluation of data set and paper combinations over the remaining intrinsic dimensionality. The x-axis is the sum of the (approximated) normalized intrinsic dimensionality of the remaining features normalized by the total sum for the whole feature set. The y-axis is the resulting evaluation score obtained by the method trained on the data set with corresponding feature selection.
  • ...and 4 more figures

Theorems & Definitions (1)

  • Definition 1