Table of Contents
Fetching ...

Temporal-Aware Evaluation and Learning for Temporal Graph Neural Networks

Junwei Su, Shan Wu

TL;DR

Temporal Graph Neural Networks (TGNNs) achieve strong performance on dynamic graphs, but existing evaluation metrics fail to capture temporal error structure, notably volatility clustering. The authors formalize metric expressiveness, prove the inadequacy of instance-based metrics, and introduce Volatility-Cluster Statistics (VCS) to detect temporal error clustering, along with Volatility-Cluster-Aware (VCA) learning to regularize models toward more uniform error distributions. They validate the approach across five datasets and six state-of-the-art TGNNs, showing that volatility patterns vary by architecture and that VCA reduces volatility clusters with manageable impact on predictive accuracy. This work enables temporally robust evaluation and training of TGNNs, with practical implications for fault-tolerant and real-time systems that depend on stable error dynamics.

Abstract

Temporal Graph Neural Networks (TGNNs) are a family of graph neural networks designed to model and learn dynamic information from temporal graphs. Given their substantial empirical success, there is an escalating interest in TGNNs within the research community. However, the majority of these efforts have been channelled towards algorithm and system design, with the evaluation metrics receiving comparatively less attention. Effective evaluation metrics are crucial for providing detailed performance insights, particularly in the temporal domain. This paper investigates the commonly used evaluation metrics for TGNNs and illustrates the failure mechanisms of these metrics in capturing essential temporal structures in the predictive behaviour of TGNNs. We provide a mathematical formulation of existing performance metrics and utilize an instance-based study to underscore their inadequacies in identifying volatility clustering (the occurrence of emerging errors within a brief interval). This phenomenon has profound implications for both algorithm and system design in the temporal domain. To address this deficiency, we introduce a new volatility-aware evaluation metric (termed volatility cluster statistics), designed for a more refined analysis of model temporal performance. Additionally, we demonstrate how this metric can serve as a temporal-volatility-aware training objective to alleviate the clustering of temporal errors. Through comprehensive experiments on various TGNN models, we validate our analysis and the proposed approach. The empirical results offer revealing insights: 1) existing TGNNs are prone to making errors with volatility clustering, and 2) TGNNs with different mechanisms to capture temporal information exhibit distinct volatility clustering patterns. Our empirical findings demonstrate that our proposed training objective effectively reduces volatility clusters in error.

Temporal-Aware Evaluation and Learning for Temporal Graph Neural Networks

TL;DR

Temporal Graph Neural Networks (TGNNs) achieve strong performance on dynamic graphs, but existing evaluation metrics fail to capture temporal error structure, notably volatility clustering. The authors formalize metric expressiveness, prove the inadequacy of instance-based metrics, and introduce Volatility-Cluster Statistics (VCS) to detect temporal error clustering, along with Volatility-Cluster-Aware (VCA) learning to regularize models toward more uniform error distributions. They validate the approach across five datasets and six state-of-the-art TGNNs, showing that volatility patterns vary by architecture and that VCA reduces volatility clusters with manageable impact on predictive accuracy. This work enables temporally robust evaluation and training of TGNNs, with practical implications for fault-tolerant and real-time systems that depend on stable error dynamics.

Abstract

Temporal Graph Neural Networks (TGNNs) are a family of graph neural networks designed to model and learn dynamic information from temporal graphs. Given their substantial empirical success, there is an escalating interest in TGNNs within the research community. However, the majority of these efforts have been channelled towards algorithm and system design, with the evaluation metrics receiving comparatively less attention. Effective evaluation metrics are crucial for providing detailed performance insights, particularly in the temporal domain. This paper investigates the commonly used evaluation metrics for TGNNs and illustrates the failure mechanisms of these metrics in capturing essential temporal structures in the predictive behaviour of TGNNs. We provide a mathematical formulation of existing performance metrics and utilize an instance-based study to underscore their inadequacies in identifying volatility clustering (the occurrence of emerging errors within a brief interval). This phenomenon has profound implications for both algorithm and system design in the temporal domain. To address this deficiency, we introduce a new volatility-aware evaluation metric (termed volatility cluster statistics), designed for a more refined analysis of model temporal performance. Additionally, we demonstrate how this metric can serve as a temporal-volatility-aware training objective to alleviate the clustering of temporal errors. Through comprehensive experiments on various TGNN models, we validate our analysis and the proposed approach. The empirical results offer revealing insights: 1) existing TGNNs are prone to making errors with volatility clustering, and 2) TGNNs with different mechanisms to capture temporal information exhibit distinct volatility clustering patterns. Our empirical findings demonstrate that our proposed training objective effectively reduces volatility clusters in error.

Paper Structure

This paper contains 34 sections, 1 theorem, 20 equations, 5 figures, 2 tables, 3 algorithms.

Key Result

Theorem 3.1

Let $\widehat{\mathbf{Y}}_1$ and $\widehat{\mathbf{Y}}_2$ be two distinct predictions for the set $\mathcal{E}$ with ground-truth $\mathbf{Y}$, and $\mu(.)$ is an instance-based evaluation metric. Then, we have that, so long as, where $\mathrm{H}(\mathbf{Y},\widehat{\mathbf{Y}}) = \sum_{k=1}^{|\mathcal{E}|} \mathds{1}[y_k \neq \widehat{y}_k]$.

Figures (5)

  • Figure 1: The Learning Procedure of TGNNs. Fig. \ref{['fig:tgnn_pipe']} depicts the learning procedure of TGNN. Data/events are split based on chronological order into training and testing/validation. During the training, data/events are further divided into temporal batches. The incoming batch serves as training samples for updating the model and embedding for the subsequent batch. Fig. \ref{['fig:comp']} visualizes the training procedure and computation of TGNNs. Incoming events are served as positive samples and negative events are sampled from the rest of the graphs.
  • Figure 2: An illustration of different error patterns. Fig. \ref{['fig:random_err']} is the pattern for random error pattern where wrong predictions are randomly distributed across the time interval. Fig. \ref{['fig:cluster_err']} is the pattern for volatility cluster error where wrong predictions are clustered at a small time interval (the end of the temporal horizon in the example). Fig. \ref{['fig:cluster_err']} is the pattern for regular error where wrong predictions are evenly spaced. The shaded area in the plots indicates the accumulated count of errors.
  • Figure 3: An illustration of the error patterns across different types of TGNNs. The x-axis represents the time during the test period, and the color density indicates the error density (number of errors per time unit). A higher density (redder) indicates more errors. As shown in the figures, memory-based TGNNs exhibit a higher error density toward the end of the testing period, while RNN-based TGNNs display a higher error density at the beginning of the testing period. Attention-based TGNNs, on the other hand, demonstrate a more uniform error distribution.
  • Figure 4: An illustration of the effects of the hyper-parameters $\tau$ and $\gamma$ on VCS and VCA. Fig. \ref{['fig:gamma_vcs']} and .\ref{['fig:gamma_ap']} demonstrate that as $\gamma$ increases, VCS performance improves while AP decreases. Hence, $\gamma$ serves as a control variable that manages the trade-off between VCS and AP. Fig. \ref{['fig:tau_var']} shows that increasing $\tau$ reduces the variance in the measure, but the marginal gain diminishes after $\tau = 5$.
  • Figure 5: Illustration of TGNN Training Procedure. The figure depicts the training flow of TGNN for two epochs. The incoming batch serves as training samples for updating the model and updating the embedding for the subsequent batch. The model parameter is carried through the second epoch.

Theorems & Definitions (3)

  • Definition 1: Expressiveness of Evaluation Metric
  • Definition 2: Instance-based Evaluation
  • Theorem 3.1: Failure of Instance-Based Evaluation