Wiki-CS: A Wikipedia-Based Benchmark for Graph Neural Networks
Péter Mernyei, Cătălina Cangea
TL;DR
The paper introduces Wiki-CS, a Wikipedia-derived graph dataset for benchmarking GNNs, focusing on Computer Science articles and hyperlink-based edges, with 10 branch classes. It addresses benchmark limitations by offering higher connectivity and varied neighborhood structures compared to classic citation networks. The authors provide a full dataset construction protocol, 300-d GloVe-based node features, multiple training splits, and experiments on semi-supervised node classification and single-relational link prediction using standard GNNs and baselines. Results show competitive performance across tasks, reinforcing the generality of GNN approaches, and the authors release code and data for reproducibility.
Abstract
We present Wiki-CS, a novel dataset derived from Wikipedia for benchmarking Graph Neural Networks. The dataset consists of nodes corresponding to Computer Science articles, with edges based on hyperlinks and 10 classes representing different branches of the field. We use the dataset to evaluate semi-supervised node classification and single-relation link prediction models. Our experiments show that these methods perform well on a new domain, with structural properties different from earlier benchmarks. The dataset is publicly available, along with the implementation of the data pipeline and the benchmark experiments, at https://github.com/pmernyei/wiki-cs-dataset .
