Table of Contents
Fetching ...

What Do Temporal Graph Learning Models Learn?

Abigail J. Hayes, Tobias Schumacher, Markus Strohmaier

TL;DR

The paper tackles the reliability and interpretability of temporal-graph benchmarks by asking what eight intuitive properties dynamic models actually learn. It introduces a property-based evaluation framework and systematically tests seven models on synthetic and real datasets, revealing a mixed picture: models reliably learn some mechanisms like preferential attachment but struggle with edge direction, density, and recency, and only a subset capture persistence or periodicity. The findings highlight fundamental limitations in current temporal graph learners and motivate interpretability-driven evaluations and targeted model improvements. Practically, the work guides practitioners in selecting and calibrating models for tasks where specific temporal properties matter and suggests directions for developing models that better capture neglected dynamics.

Abstract

Learning on temporal graphs has become a central topic in graph representation learning, with numerous benchmarks indicating the strong performance of state-of-the-art models. However, recent work has raised concerns about the reliability of benchmark results, noting issues with commonly used evaluation protocols and the surprising competitiveness of simple heuristics. This contrast raises the question of which properties of the underlying graphs temporal graph learning models actually use to form their predictions. We address this by systematically evaluating seven models on their ability to capture eight fundamental attributes related to the link structure of temporal graphs. These include structural characteristics such as density, temporal patterns such as recency, and edge formation mechanisms such as homophily. Using both synthetic and real-world datasets, we analyze how well models learn these attributes. Our findings reveal a mixed picture: models capture some attributes well but fail to reproduce others. With this, we expose important limitations. Overall, we believe that our results provide practical insights for the application of temporal graph learning models, and motivate more interpretability-driven evaluations in temporal graph learning research.

What Do Temporal Graph Learning Models Learn?

TL;DR

The paper tackles the reliability and interpretability of temporal-graph benchmarks by asking what eight intuitive properties dynamic models actually learn. It introduces a property-based evaluation framework and systematically tests seven models on synthetic and real datasets, revealing a mixed picture: models reliably learn some mechanisms like preferential attachment but struggle with edge direction, density, and recency, and only a subset capture persistence or periodicity. The findings highlight fundamental limitations in current temporal graph learners and motivate interpretability-driven evaluations and targeted model improvements. Practically, the work guides practitioners in selecting and calibrating models for tasks where specific temporal properties matter and suggests directions for developing models that better capture neglected dynamics.

Abstract

Learning on temporal graphs has become a central topic in graph representation learning, with numerous benchmarks indicating the strong performance of state-of-the-art models. However, recent work has raised concerns about the reliability of benchmark results, noting issues with commonly used evaluation protocols and the surprising competitiveness of simple heuristics. This contrast raises the question of which properties of the underlying graphs temporal graph learning models actually use to form their predictions. We address this by systematically evaluating seven models on their ability to capture eight fundamental attributes related to the link structure of temporal graphs. These include structural characteristics such as density, temporal patterns such as recency, and edge formation mechanisms such as homophily. Using both synthetic and real-world datasets, we analyze how well models learn these attributes. Our findings reveal a mixed picture: models capture some attributes well but fail to reproduce others. With this, we expose important limitations. Overall, we believe that our results provide practical insights for the application of temporal graph learning models, and motivate more interpretability-driven evaluations in temporal graph learning research.

Paper Structure

This paper contains 20 sections, 11 figures, 5 tables.

Figures (11)

  • Figure 1: Direction: ability of graph learning models to distinguish directions of edges. For each positive edge $(u,v)$ in the UCI test data, we take the probability predicted by the graph learning models and compute the absolute distance to the predicted probability for the non-existing reverse edge $(v,u)$. Panel (a) depicts the cumulative distribution of the distance values when training with the original training edges from the data, panel (b) shows these values when within training, both positive and negative edges are provided in both directions. We observe in (a) that, for most models, on roughly 50% of all edges the probability of edges being predicted is nearly symmetric with a difference smaller than 0.02. This indicates strong limitations in the ability of models to distinguish directions of edges (✗). Similar results for the Enron dataset are in Appendix \ref{['ap:results']}. Training bidirectional edges even increases symmetry in predictions, with 90% of all edges having a difference less than 0.01 for many models, and only DyGFormer still yielding high differences.
  • Figure 2: Density: ability of models to replicate true density of networks. We trained on the same set of positive training edges, but varied the negative sampling ratio. We depict the density resulting from predicting on all potential edges. Predicted density is generally much lower than the density seen during training. True density also appears hard to approximate, as models seem prone to predicting no edges when seeing larger numbers of negative edges. Thus, models do not appear to learn density (✗).
  • Figure 3: Persistence: ability of models to learn persistent graphs. We trained the temporal graph models on fixed snapshots from the UCI dataset, which were repeated throughout training, and depict the average probability scores resulting from each model when predicting positive and negative edges of these snapshots. Only TGAT and DyGFormer appear to reproduce fixed graphs with reasonable confidence (✓).
  • Figure 4: Periodicity: ability of models to learn periodically repeated edges. We selected pairs of consecutive snapshots from the UCI dataset, and tested whether the temporal graph learning models could reproduce a consistent pattern of two oscillating snapshots. We depict average predicted probabilities when testing at even (left) and odd (right) timestamps, colors correspond to predictions on edges present at odd, even, both, or neither timesteps. Only GraphMixer and TCL appear to properly reproduce the training pattern (✓).
  • Figure 5: Recency: impact of time that an edge was last seen on its probability score at test time. For 10 timesteps, we sampled a random set of positive edges. These edge sets are disjoint over all timesteps, and reflect the density at representative timesteps in the original corresponding dataset. We show average predicted probability scores at timestep $t=11$ for all positive edges seen during training, separated by the timestep in which they were seen. Overall, we observe that there is no consistent trend regarding whether more recently (or earlier) edges have higher probability scores (✗). Instead, all edges appear to have very similar probability scores on average, with the exception of TGAT on the graphs relating to the Wikipedia dataset.
  • ...and 6 more figures