Online GNN Evaluation Under Test-time Graph Distribution Shifts
Xin Zheng, Dongjin Song, Qingsong Wen, Bo Du, Shirui Pan
TL;DR
The paper tackles online GNN evaluation by addressing the lack of test labels and training-data access under test-time distribution shifts. It introduces LeBed, a Learning Behavior Discrepancy score computed via a three-step process: test-graph inference, parameter-free re-training guided by node-prediction discrepancy, and a structure-reconstruction based stopping criterion, culminating in LeBed = ||θ_tr^* − θ_te^†||_2. The method leverages D_Pred and D_Stru to proxy generalization performance and demonstrates strong correlations with ground-truth test errors across diverse real-world graphs and GNN architectures, outperforming adapted CNN baselines. This provides a practical, label-free metric for reliable online GNN deployment under distribution shifts and privacy constraints, with implications for safer graph-based serving in industry. Key contributions include formalizing online GNN evaluation, proposing a parameter-free optimality criterion, and validating LeBed’s effectiveness and efficiency across multiple datasets and shifts, while recognizing assumptions such as a fixed label space.
Abstract
Evaluating the performance of a well-trained GNN model on real-world graphs is a pivotal step for reliable GNN online deployment and serving. Due to a lack of test node labels and unknown potential training-test graph data distribution shifts, conventional model evaluation encounters limitations in calculating performance metrics (e.g., test error) and measuring graph data-level discrepancies, particularly when the training graph used for developing GNNs remains unobserved during test time. In this paper, we study a new research problem, online GNN evaluation, which aims to provide valuable insights into the well-trained GNNs's ability to effectively generalize to real-world unlabeled graphs under the test-time graph distribution shifts. Concretely, we develop an effective learning behavior discrepancy score, dubbed LeBeD, to estimate the test-time generalization errors of well-trained GNN models. Through a novel GNN re-training strategy with a parameter-free optimality criterion, the proposed LeBeD comprehensively integrates learning behavior discrepancies from both node prediction and structure reconstruction perspectives. This enables the effective evaluation of the well-trained GNNs' ability to capture test node semantics and structural representations, making it an expressive metric for estimating the generalization error in online GNN evaluation. Extensive experiments on real-world test graphs under diverse graph distribution shifts could verify the effectiveness of the proposed method, revealing its strong correlation with ground-truth test errors on various well-trained GNN models.
