Property-Driven Evaluation of GNN Expressiveness at Scale: Datasets, Framework, and Study

Sicong Che; Jiayi Yang; Sarfraz Khurshid; Wenxi Wang

Property-Driven Evaluation of GNN Expressiveness at Scale: Datasets, Framework, and Study

Sicong Che, Jiayi Yang, Sarfraz Khurshid, Wenxi Wang

TL;DR

This work proposes a general evaluation framework that assesses three key aspects of GNN expressiveness: generalizability, sensitivity, and robustness, with two novel quantitative metrics and conducts the first comprehensive study on global pooling methods' impact on GNN expressiveness.

Abstract

Advancing trustworthy AI requires principled software engineering approaches to model evaluation. Graph Neural Networks (GNNs) have achieved remarkable success in processing graph-structured data, however, their expressiveness in capturing fundamental graph properties remains an open challenge. We address this by developing a property-driven evaluation methodology grounded in formal specification, systematic evaluation, and empirical study. Leveraging Alloy, a software specification language and analyzer, we introduce a configurable graph dataset generator that produces two dataset families: GraphRandom, containing diverse graphs that either satisfy or violate specific properties, and GraphPerturb, introducing controlled structural variations. Together, these benchmarks encompass 336 new datasets, each with at least 10,000 labeled graphs, covering 16 fundamental graph properties critical to distributed systems, knowledge graphs, and biological networks. We propose a general evaluation framework that assesses three key aspects of GNN expressiveness: generalizability, sensitivity, and robustness, with two novel quantitative metrics. Using this framework, we conduct the first comprehensive study on global pooling methods' impact on GNN expressiveness. Our findings reveal distinct trade-offs: attention-based pooling excels in generalization and robustness, while second-order pooling provides superior sensitivity, but no single approach consistently performs well across all properties. These insights highlight fundamental limitations and open research directions including adaptive property-aware pooling, scale-sensitive architectures, and robustness-oriented training. By embedding software engineering rigor into AI evaluation, this work establishes a principled foundation for developing expressive and reliable GNN architectures.

Property-Driven Evaluation of GNN Expressiveness at Scale: Datasets, Framework, and Study

TL;DR

Abstract

Paper Structure (17 sections, 6 equations, 6 figures, 4 tables, 1 algorithm)

This paper contains 17 sections, 6 equations, 6 figures, 4 tables, 1 algorithm.

Introduction
Background: Alloy
Related Work
Graph Dataset Generation using Alloy
GraphRandom Dataset Family
GraphPerturb Dataset Family
GNN Expressiveness Evaluation Framework
Training and Testing Sets
Three Key Aspects
Evaluation Metrics
Study: Global Pooling for GNN Expressiveness
Impact of Global Pooling
Global Pooling Comparisons
Global Pooling across Graph Sizes
Future Directions for Global Pooling
...and 2 more sections

Figures (6)

Figure 1: GraphRandom dataset family generation.
Figure 2: GraphPerturb dataset family generation; take graph size equal to 5 as an example.
Figure 3: Unified scores of GNNs with different pooling methods across three aspects on 16 properties.
Figure 4: Global pooling performance across ten graph sizes under generalizability aspect.
Figure 5: Global pooling performance across ten graph sizes under sensitivity aspect.
...and 1 more figures

Property-Driven Evaluation of GNN Expressiveness at Scale: Datasets, Framework, and Study

TL;DR

Abstract

Property-Driven Evaluation of GNN Expressiveness at Scale: Datasets, Framework, and Study

Authors

TL;DR

Abstract

Table of Contents

Figures (6)