Table of Contents
Fetching ...

Exploring the Performance of Continuous-Time Dynamic Link Prediction Algorithms

Raphaël Romero, Maarten Buyl, Tijl De Bie, Jefrey Lijffijt

TL;DR

Dynamic Link Prediction on Continuous-Time Dynamic Graphs suffers from evaluation bias when using single metrics and naive time-based splits. The authors introduce Birth-Death diagrams and the Surprise Index to visualize dataset difficulty and the effect of train/test partitioning, along with a taxonomy of negative sampling strategies to stress-test evaluation. Through empirical studies on real-world CTDG benchmarks, they show that negative sampling choices can dramatically alter AUC and that performance trajectories over time expose distinct failure modes across methods. The proposed toolkit and guidelines enable more robust, interpretable, and fair evaluations of DLP methods across domains.

Abstract

Dynamic Link Prediction (DLP) addresses the prediction of future links in evolving networks. However, accurately portraying the performance of DLP algorithms poses challenges that might impede progress in the field. Importantly, common evaluation pipelines usually calculate ranking or binary classification metrics, where the scores of observed interactions (positives) are compared with those of randomly generated ones (negatives). However, a single metric is not sufficient to fully capture the differences between DLP algorithms, and is prone to overly optimistic performance evaluation. Instead, an in-depth evaluation should reflect performance variations across different nodes, edges, and time segments. In this work, we contribute tools to perform such a comprehensive evaluation. (1) We propose Birth-Death diagrams, a simple but powerful visualization technique that illustrates the effect of time-based train-test splitting on the difficulty of DLP on a given dataset. (2) We describe an exhaustive taxonomy of negative sampling methods that can be used at evaluation time. (3) We carry out an empirical study of the effect of the different negative sampling strategies. Our comparison between heuristics and state-of-the-art memory-based methods on various real-world datasets confirms a strong effect of using different negative sampling strategies on the test Area Under the Curve (AUC). Moreover, we conduct a visual exploration of the prediction, with additional insights on which different types of errors are prominent over time.

Exploring the Performance of Continuous-Time Dynamic Link Prediction Algorithms

TL;DR

Dynamic Link Prediction on Continuous-Time Dynamic Graphs suffers from evaluation bias when using single metrics and naive time-based splits. The authors introduce Birth-Death diagrams and the Surprise Index to visualize dataset difficulty and the effect of train/test partitioning, along with a taxonomy of negative sampling strategies to stress-test evaluation. Through empirical studies on real-world CTDG benchmarks, they show that negative sampling choices can dramatically alter AUC and that performance trajectories over time expose distinct failure modes across methods. The proposed toolkit and guidelines enable more robust, interpretable, and fair evaluations of DLP methods across domains.

Abstract

Dynamic Link Prediction (DLP) addresses the prediction of future links in evolving networks. However, accurately portraying the performance of DLP algorithms poses challenges that might impede progress in the field. Importantly, common evaluation pipelines usually calculate ranking or binary classification metrics, where the scores of observed interactions (positives) are compared with those of randomly generated ones (negatives). However, a single metric is not sufficient to fully capture the differences between DLP algorithms, and is prone to overly optimistic performance evaluation. Instead, an in-depth evaluation should reflect performance variations across different nodes, edges, and time segments. In this work, we contribute tools to perform such a comprehensive evaluation. (1) We propose Birth-Death diagrams, a simple but powerful visualization technique that illustrates the effect of time-based train-test splitting on the difficulty of DLP on a given dataset. (2) We describe an exhaustive taxonomy of negative sampling methods that can be used at evaluation time. (3) We carry out an empirical study of the effect of the different negative sampling strategies. Our comparison between heuristics and state-of-the-art memory-based methods on various real-world datasets confirms a strong effect of using different negative sampling strategies on the test Area Under the Curve (AUC). Moreover, we conduct a visual exploration of the prediction, with additional insights on which different types of errors are prominent over time.
Paper Structure (21 sections, 6 equations, 7 figures, 3 tables)

This paper contains 21 sections, 6 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: A Birth-Death diagram on a recording of face-to-face interactions between HighSchool students over 9 days HighSchool_Fournet_Barrat_2014. The y and x coordinate for each node/edge represent their first (Birth) and last (Death) interaction time respectively. Given a cutoff-time $t_{split}$, while the history of interaction gets divided into a train and a test set, the nodes and edges get partitioned into three categories: Historical, Overlap and Inductive. The Surprise Index is the ratio $\frac{\textcolor{ao(english)}{Inductive}}{\textcolor{ao(english)}{Inductive}+\textcolor{deepsaffron}{Overlap}}$.
  • Figure 2: Birth-Death diagrams for Nodes and Edges in datasets from the Dynamic Graph Benchmark from poursafaeiBetterEvaluationDynamic2023. The datasets are split into train and test sets containing 85% and 15% of the events, respectively.
  • Figure 3: Changing the test-split ratio linearly from 0.1 to 0.5 changes the node and Edge Surprise Index differently depending on the dataset. The typical test-split ratio of 0.15 is marked as a "*" on the lines.
  • Figure 4: Test AUC results obtained by comparing the scores of the positive events with the scores of the negative events, sampled using specific strategies. For Dyrep and TGN, we retrained the models with 5 different seeds and report the mean and the standard deviation of the resulting AUCs.
  • Figure 5: MAR of each method over time. The positive event "Pos" is ranked against negative events resulting from different Negative Edge Sampling strategies: Historical Edge, Overlap Edge, Inductive Edge, as defined in Section \ref{['sec:taxonomy']}.
  • ...and 2 more figures

Theorems & Definitions (13)

  • Definition : Birth Time
  • Definition : Death Time
  • Definition : Historical
  • Definition : Inductive
  • Definition : Overlap
  • Remark 1
  • Definition
  • Definition
  • Definition
  • Definition : Negative Node Sampling
  • ...and 3 more