Table of Contents
Fetching ...

Online GNN Evaluation Under Test-time Graph Distribution Shifts

Xin Zheng, Dongjin Song, Qingsong Wen, Bo Du, Shirui Pan

TL;DR

The paper tackles online GNN evaluation by addressing the lack of test labels and training-data access under test-time distribution shifts. It introduces LeBed, a Learning Behavior Discrepancy score computed via a three-step process: test-graph inference, parameter-free re-training guided by node-prediction discrepancy, and a structure-reconstruction based stopping criterion, culminating in LeBed = ||θ_tr^* − θ_te^†||_2. The method leverages D_Pred and D_Stru to proxy generalization performance and demonstrates strong correlations with ground-truth test errors across diverse real-world graphs and GNN architectures, outperforming adapted CNN baselines. This provides a practical, label-free metric for reliable online GNN deployment under distribution shifts and privacy constraints, with implications for safer graph-based serving in industry. Key contributions include formalizing online GNN evaluation, proposing a parameter-free optimality criterion, and validating LeBed’s effectiveness and efficiency across multiple datasets and shifts, while recognizing assumptions such as a fixed label space.

Abstract

Evaluating the performance of a well-trained GNN model on real-world graphs is a pivotal step for reliable GNN online deployment and serving. Due to a lack of test node labels and unknown potential training-test graph data distribution shifts, conventional model evaluation encounters limitations in calculating performance metrics (e.g., test error) and measuring graph data-level discrepancies, particularly when the training graph used for developing GNNs remains unobserved during test time. In this paper, we study a new research problem, online GNN evaluation, which aims to provide valuable insights into the well-trained GNNs's ability to effectively generalize to real-world unlabeled graphs under the test-time graph distribution shifts. Concretely, we develop an effective learning behavior discrepancy score, dubbed LeBeD, to estimate the test-time generalization errors of well-trained GNN models. Through a novel GNN re-training strategy with a parameter-free optimality criterion, the proposed LeBeD comprehensively integrates learning behavior discrepancies from both node prediction and structure reconstruction perspectives. This enables the effective evaluation of the well-trained GNNs' ability to capture test node semantics and structural representations, making it an expressive metric for estimating the generalization error in online GNN evaluation. Extensive experiments on real-world test graphs under diverse graph distribution shifts could verify the effectiveness of the proposed method, revealing its strong correlation with ground-truth test errors on various well-trained GNN models.

Online GNN Evaluation Under Test-time Graph Distribution Shifts

TL;DR

The paper tackles online GNN evaluation by addressing the lack of test labels and training-data access under test-time distribution shifts. It introduces LeBed, a Learning Behavior Discrepancy score computed via a three-step process: test-graph inference, parameter-free re-training guided by node-prediction discrepancy, and a structure-reconstruction based stopping criterion, culminating in LeBed = ||θ_tr^* − θ_te^†||_2. The method leverages D_Pred and D_Stru to proxy generalization performance and demonstrates strong correlations with ground-truth test errors across diverse real-world graphs and GNN architectures, outperforming adapted CNN baselines. This provides a practical, label-free metric for reliable online GNN deployment under distribution shifts and privacy constraints, with implications for safer graph-based serving in industry. Key contributions include formalizing online GNN evaluation, proposing a parameter-free optimality criterion, and validating LeBed’s effectiveness and efficiency across multiple datasets and shifts, while recognizing assumptions such as a fixed label space.

Abstract

Evaluating the performance of a well-trained GNN model on real-world graphs is a pivotal step for reliable GNN online deployment and serving. Due to a lack of test node labels and unknown potential training-test graph data distribution shifts, conventional model evaluation encounters limitations in calculating performance metrics (e.g., test error) and measuring graph data-level discrepancies, particularly when the training graph used for developing GNNs remains unobserved during test time. In this paper, we study a new research problem, online GNN evaluation, which aims to provide valuable insights into the well-trained GNNs's ability to effectively generalize to real-world unlabeled graphs under the test-time graph distribution shifts. Concretely, we develop an effective learning behavior discrepancy score, dubbed LeBeD, to estimate the test-time generalization errors of well-trained GNN models. Through a novel GNN re-training strategy with a parameter-free optimality criterion, the proposed LeBeD comprehensively integrates learning behavior discrepancies from both node prediction and structure reconstruction perspectives. This enables the effective evaluation of the well-trained GNNs' ability to capture test node semantics and structural representations, making it an expressive metric for estimating the generalization error in online GNN evaluation. Extensive experiments on real-world test graphs under diverse graph distribution shifts could verify the effectiveness of the proposed method, revealing its strong correlation with ground-truth test errors on various well-trained GNN models.
Paper Structure (21 sections, 13 equations, 14 figures, 10 tables, 1 algorithm)

This paper contains 21 sections, 13 equations, 14 figures, 10 tables, 1 algorithm.

Figures (14)

  • Figure 1: Illustration of the proposed online GNN evaluation problem and our solution.
  • Figure 2: The overview of the proposed LeBed score for online GNN evaluation under test-time distribution shifts, including three steps, i.e., S1: Test graph online inference; S2: GNN re-training with parameter-free optimality criterion; S3: LeBed score computation. All $*$ superscripts indicate the corresponding variables would remain fixed for online GNN evaluation.
  • Figure 3: Running time comparisonon Citationv2 dataset w/ and w/o the proposed $D_{\text{stru.}}$ based criterion.
  • Figure 4: Hyper-parameter sensitivity analysis on $\epsilon$ in the proposed parameter-free optimality criterion. (left: Amazon-Photo dataset with fixed constant setting; right: DBLPv8 dataset with fixed ratio (%) setting.)
  • Figure 5: Correlation visualization comparison among ATC-MC of GCN on Cora, our LeBed of GCN on Cora, Thres.($\tau=0.9$) of GAT on Amazon-Photo, and our LeBed of GAT on Amazon-Photo.
  • ...and 9 more figures