Table of Contents
Fetching ...

Wiki-CS: A Wikipedia-Based Benchmark for Graph Neural Networks

Péter Mernyei, Cătălina Cangea

TL;DR

The paper introduces Wiki-CS, a Wikipedia-derived graph dataset for benchmarking GNNs, focusing on Computer Science articles and hyperlink-based edges, with 10 branch classes. It addresses benchmark limitations by offering higher connectivity and varied neighborhood structures compared to classic citation networks. The authors provide a full dataset construction protocol, 300-d GloVe-based node features, multiple training splits, and experiments on semi-supervised node classification and single-relational link prediction using standard GNNs and baselines. Results show competitive performance across tasks, reinforcing the generality of GNN approaches, and the authors release code and data for reproducibility.

Abstract

We present Wiki-CS, a novel dataset derived from Wikipedia for benchmarking Graph Neural Networks. The dataset consists of nodes corresponding to Computer Science articles, with edges based on hyperlinks and 10 classes representing different branches of the field. We use the dataset to evaluate semi-supervised node classification and single-relation link prediction models. Our experiments show that these methods perform well on a new domain, with structural properties different from earlier benchmarks. The dataset is publicly available, along with the implementation of the data pipeline and the benchmark experiments, at https://github.com/pmernyei/wiki-cs-dataset .

Wiki-CS: A Wikipedia-Based Benchmark for Graph Neural Networks

TL;DR

The paper introduces Wiki-CS, a Wikipedia-derived graph dataset for benchmarking GNNs, focusing on Computer Science articles and hyperlink-based edges, with 10 branch classes. It addresses benchmark limitations by offering higher connectivity and varied neighborhood structures compared to classic citation networks. The authors provide a full dataset construction protocol, 300-d GloVe-based node features, multiple training splits, and experiments on semi-supervised node classification and single-relational link prediction using standard GNNs and baselines. Results show competitive performance across tasks, reinforcing the generality of GNN approaches, and the authors release code and data for reproducibility.

Abstract

We present Wiki-CS, a novel dataset derived from Wikipedia for benchmarking Graph Neural Networks. The dataset consists of nodes corresponding to Computer Science articles, with edges based on hyperlinks and 10 classes representing different branches of the field. We use the dataset to evaluate semi-supervised node classification and single-relation link prediction models. Our experiments show that these methods perform well on a new domain, with structural properties different from earlier benchmarks. The dataset is publicly available, along with the implementation of the data pipeline and the benchmark experiments, at https://github.com/pmernyei/wiki-cs-dataset .

Paper Structure

This paper contains 17 sections, 4 figures, 5 tables.

Figures (4)

  • Figure 1: A subgraph of the subcategory relation graph. Nodes with dark borders are the prominent categories chosen based on centrality. The others were aggregated to the nearest marked ancestor as denoted by their colors, with ties broken arbitrarily.
  • Figure 2: Distribution of the ratio of neighbors belonging to the same class. In all three the citation network datasets, almost two-thirds of all nodes have all neighbors belonging to the same class. The distribution of Wiki-CS is considerably more balanced.
  • Figure 3: Deep Graph Mapper (DGM) visualisation of benchmarks. Each node in the figure corresponds to a cluster of similar nodes in the original graph, with edge thickness representing the amount of connections between clusters. Colors represent the most frequent class in each cluster. The DGM unsupervised embedding process did not take labels into account, only relying on the node features and edges. The hyperparameters are described in Appendix \ref{['app:hyperparameters']}.
  • Figure 4: Deep Graph Mapper visualisation of the predictions of different node classification models. The top image colors each cluster according to its most frequent true label, similar to Figure \ref{['fig:dgm-vis-wiki-cs']}. The other plots have clusters colored according to the most frequent prediction of the appropriate model. Note that this can hide differences that do not change the majority prediction in a cluster. The specific parameters used are described in Appendix \ref{['app:hyperparameters']}.