Online GNN Evaluation Under Test-time Graph Distribution Shifts

Xin Zheng; Dongjin Song; Qingsong Wen; Bo Du; Shirui Pan

Online GNN Evaluation Under Test-time Graph Distribution Shifts

Xin Zheng, Dongjin Song, Qingsong Wen, Bo Du, Shirui Pan

TL;DR

The paper tackles online GNN evaluation by addressing the lack of test labels and training-data access under test-time distribution shifts. It introduces LeBed, a Learning Behavior Discrepancy score computed via a three-step process: test-graph inference, parameter-free re-training guided by node-prediction discrepancy, and a structure-reconstruction based stopping criterion, culminating in LeBed = ||θ_tr^* − θ_te^†||_2. The method leverages D_Pred and D_Stru to proxy generalization performance and demonstrates strong correlations with ground-truth test errors across diverse real-world graphs and GNN architectures, outperforming adapted CNN baselines. This provides a practical, label-free metric for reliable online GNN deployment under distribution shifts and privacy constraints, with implications for safer graph-based serving in industry. Key contributions include formalizing online GNN evaluation, proposing a parameter-free optimality criterion, and validating LeBed’s effectiveness and efficiency across multiple datasets and shifts, while recognizing assumptions such as a fixed label space.

Abstract

Evaluating the performance of a well-trained GNN model on real-world graphs is a pivotal step for reliable GNN online deployment and serving. Due to a lack of test node labels and unknown potential training-test graph data distribution shifts, conventional model evaluation encounters limitations in calculating performance metrics (e.g., test error) and measuring graph data-level discrepancies, particularly when the training graph used for developing GNNs remains unobserved during test time. In this paper, we study a new research problem, online GNN evaluation, which aims to provide valuable insights into the well-trained GNNs's ability to effectively generalize to real-world unlabeled graphs under the test-time graph distribution shifts. Concretely, we develop an effective learning behavior discrepancy score, dubbed LeBeD, to estimate the test-time generalization errors of well-trained GNN models. Through a novel GNN re-training strategy with a parameter-free optimality criterion, the proposed LeBeD comprehensively integrates learning behavior discrepancies from both node prediction and structure reconstruction perspectives. This enables the effective evaluation of the well-trained GNNs' ability to capture test node semantics and structural representations, making it an expressive metric for estimating the generalization error in online GNN evaluation. Extensive experiments on real-world test graphs under diverse graph distribution shifts could verify the effectiveness of the proposed method, revealing its strong correlation with ground-truth test errors on various well-trained GNN models.

Online GNN Evaluation Under Test-time Graph Distribution Shifts

TL;DR

Abstract

Paper Structure (21 sections, 13 equations, 14 figures, 10 tables, 1 algorithm)

This paper contains 21 sections, 13 equations, 14 figures, 10 tables, 1 algorithm.

Introduction
The Proposed Method
Preliminary.
Problem Formulation
LeBed: Learning Behavior Discrepancy Score
Experiments
Experimental Settings
Online GNN Model Evaluation Performance
In-depth Analysis of the Proposed LeBed
Conclusion
Ethics Statement
Reproducibility Statement
Related Work
Test-time Dataset Details
In-depth Analysis and More Results
...and 6 more sections

Figures (14)

Figure 1: Illustration of the proposed online GNN evaluation problem and our solution.
Figure 2: The overview of the proposed LeBed score for online GNN evaluation under test-time distribution shifts, including three steps, i.e., S1: Test graph online inference; S2: GNN re-training with parameter-free optimality criterion; S3: LeBed score computation. All $*$ superscripts indicate the corresponding variables would remain fixed for online GNN evaluation.
Figure 3: Running time comparisonon Citationv2 dataset w/ and w/o the proposed $D_{\text{stru.}}$ based criterion.
Figure 4: Hyper-parameter sensitivity analysis on $\epsilon$ in the proposed parameter-free optimality criterion. (left: Amazon-Photo dataset with fixed constant setting; right: DBLPv8 dataset with fixed ratio (%) setting.)
Figure 5: Correlation visualization comparison among ATC-MC of GCN on Cora, our LeBed of GCN on Cora, Thres.($\tau=0.9$) of GAT on Amazon-Photo, and our LeBed of GAT on Amazon-Photo.
...and 9 more figures

Online GNN Evaluation Under Test-time Graph Distribution Shifts

TL;DR

Abstract

Online GNN Evaluation Under Test-time Graph Distribution Shifts

Authors

TL;DR

Abstract

Table of Contents

Figures (14)