An Empirical Study of Realized GNN Expressiveness

Yanbo Wang; Muhan Zhang

An Empirical Study of Realized GNN Expressiveness

Yanbo Wang, Muhan Zhang

TL;DR

This paper introduces BREC, a large, diverse dataset designed to test GNN expressiveness beyond the 1-WL bound up to 4-WL-indistinguishable graphs, addressing prior datasets' limitations in difficulty, granularity, and scale. It pairs BREC with RPC, a robust evaluation framework using a Siamese GNN and Hotelling's T-squared tests to quantify real-world discriminative power while accounting for numerical fluctuations. Through extensive experiments on 23 models, the study shows that realized expressiveness largely tracks theoretical expectations but also reveals notable gaps, with distances encoding and optimal subgraph radii being crucial for performance. The work provides practical tools and insights to guide the development of more expressive GNN architectures, and releases the dataset and code publicly to facilitate reproducible benchmarking.

Abstract

Research on the theoretical expressiveness of Graph Neural Networks (GNNs) has developed rapidly, and many methods have been proposed to enhance the expressiveness. However, most methods do not have a uniform expressiveness measure except for a few that strictly follow the $k$-dimensional Weisfeiler-Lehman ($k$-WL) test hierarchy, leading to difficulties in quantitatively comparing their expressiveness. Previous research has attempted to use datasets for measurement, but facing problems with difficulty (any model surpassing 1-WL has nearly 100% accuracy), granularity (models tend to be either 100% correct or near random guess), and scale (only several essentially different graphs involved). To address these limitations, we study the realized expressive power that a practical model instance can achieve using a novel expressiveness dataset, BREC, which poses greater difficulty (with up to 4-WL-indistinguishable graphs), finer granularity (enabling comparison of models between 1-WL and 3-WL), a larger scale (consisting of 800 1-WL-indistinguishable graphs that are non-isomorphic to each other). We synthetically test 23 models with higher-than-1-WL expressiveness on BREC. Our experiment gives the first thorough measurement of the realized expressiveness of those state-of-the-art beyond-1-WL GNN models and reveals the gap between theoretical and realized expressiveness. Dataset and evaluation codes are released at: https://github.com/GraphPKU/BREC.

An Empirical Study of Realized GNN Expressiveness

TL;DR

Abstract

-dimensional Weisfeiler-Lehman (

-WL) test hierarchy, leading to difficulties in quantitatively comparing their expressiveness. Previous research has attempted to use datasets for measurement, but facing problems with difficulty (any model surpassing 1-WL has nearly 100% accuracy), granularity (models tend to be either 100% correct or near random guess), and scale (only several essentially different graphs involved). To address these limitations, we study the realized expressive power that a practical model instance can achieve using a novel expressiveness dataset, BREC, which poses greater difficulty (with up to 4-WL-indistinguishable graphs), finer granularity (enabling comparison of models between 1-WL and 3-WL), a larger scale (consisting of 800 1-WL-indistinguishable graphs that are non-isomorphic to each other). We synthetically test 23 models with higher-than-1-WL expressiveness on BREC. Our experiment gives the first thorough measurement of the realized expressiveness of those state-of-the-art beyond-1-WL GNN models and reveals the gap between theoretical and realized expressiveness. Dataset and evaluation codes are released at: https://github.com/GraphPKU/BREC.

Paper Structure (26 sections, 1 theorem, 14 equations, 6 figures, 11 tables)

This paper contains 26 sections, 1 theorem, 14 equations, 6 figures, 11 tables.

Introduction
Limitations of Existing Datasets
BREC: A New Dataset for Expressiveness
Dataset Composition
Advantages
RPC: A New Evaluation Diagram
Training Framework
Evaluation Method
Experiment
Conclusion and Future Work
Details on Regular Graphs
Node Features
WL Algorithm
Circulant Skip Links (CSL) Graphs
GNN Extensions
...and 11 more sections

Key Result

Theorem 9.1

The false positive rate with adaptive confidence interval is $\frac{1}{2^{2P}}$.

Figures (6)

Figure 1: Sample graphs in previous datasets
Figure 2: BREC dataset samples
Figure 3: Evaluation Method. (a) illustrates the training framework, where two non-isomorphic graphs are input into a Siamese network architecture to increase the distance between their respective representations. (b) presents the RPC (Reliability and Performance Characterization) pipeline, which comprises two main components: the Major Procedure and the Reliability Check. The Major Procedure operates on non-isomorphic pairs to quantify the external differences, whereas the Reliability Check is performed on isomorphic pairs to assess internal fluctuations. We calculate the corresponding $T^2$-statistics and compare them with a predefined threshold. A pair is considered successfully distinguished only if it passes both tests.
Figure 4: Regular graphs relationship
Figure 5: BREC Statistics
...and 1 more figures

Theorems & Definitions (2)

Theorem 9.1
proof

An Empirical Study of Realized GNN Expressiveness

TL;DR

Abstract

An Empirical Study of Realized GNN Expressiveness

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (2)