Mind the truncation gap: challenges of learning on dynamic graphs with recurrent architectures

João Bravo; Jacopo Bono; Pedro Saleiro; Hugo Ferreira; Pedro Bizarro

Mind the truncation gap: challenges of learning on dynamic graphs with recurrent architectures

João Bravo, Jacopo Bono, Pedro Saleiro, Hugo Ferreira, Pedro Bizarro

TL;DR

The paper tackles learning on continuous-time dynamic graphs using graph recurrent neural networks and identifies a truncation gap in backpropagation through time caused by batch-based training. It introduces a synthetic edge-regression task with node memory buffers of length $M$, where the target $y_k$ depends on the last elements of each endpoint, to quantify long-horizon dependencies. Experiments on both synthetic data and real-world dynamic-graph benchmarks (Reddit, Wikipedia, MOOC) show a consistent performance gap between full BPTT (F-BPTT) and truncated BPTT (TBPTT), with F-BPTT delivering meaningful gains. The authors discuss future directions beyond backpropagation, including unbiased online learning approximations, and argue for more research to unlock GRNNs' capacity for long-range temporal reasoning.

Abstract

Systems characterized by evolving interactions, prevalent in social, financial, and biological domains, are effectively modeled as continuous-time dynamic graphs (CTDGs). To manage the scale and complexity of these graph datasets, machine learning (ML) approaches have become essential. However, CTDGs pose challenges for ML because traditional static graph methods do not naturally account for event timings. Newer approaches, such as graph recurrent neural networks (GRNNs), are inherently time-aware and offer advantages over static methods for CTDGs. However, GRNNs face another issue: the short truncation of backpropagation-through-time (BPTT), whose impact has not been properly examined until now. In this work, we demonstrate that this truncation can limit the learning of dependencies beyond a single hop, resulting in reduced performance. Through experiments on a novel synthetic task and real-world datasets, we reveal a performance gap between full backpropagation-through-time (F-BPTT) and the truncated backpropagation-through-time (T-BPTT) commonly used to train GRNN models. We term this gap the "truncation gap" and argue that understanding and addressing it is essential as the importance of CTDGs grows, discussing potential future directions for research in this area.

Mind the truncation gap: challenges of learning on dynamic graphs with recurrent architectures

TL;DR

, where the target

depends on the last elements of each endpoint, to quantify long-horizon dependencies. Experiments on both synthetic data and real-world dynamic-graph benchmarks (Reddit, Wikipedia, MOOC) show a consistent performance gap between full BPTT (F-BPTT) and truncated BPTT (TBPTT), with F-BPTT delivering meaningful gains. The authors discuss future directions beyond backpropagation, including unbiased online learning approximations, and argue for more research to unlock GRNNs' capacity for long-range temporal reasoning.

Abstract

Paper Structure (14 sections, 7 equations, 6 figures, 2 tables)

This paper contains 14 sections, 7 equations, 6 figures, 2 tables.

Introduction
Background
Graph Recurrent Neural Networks (GRNNs)
Training GRNNs: Batch Processing Strategies
Deep CoEvolve and DyRep.
JODIE.
TGN.
Synthetic Task
Task Specification
Results
Dynamic Graph Benchmark Results
Beyond Backpropagation
Conclusions
A General Framework for GRNNs

Figures (6)

Figure 1: Truncation of temporal history becomes severe in dynamic graphs. (left) Sequence based data can be grouped by sequence when defining batches. In this specific example of sequences with two events, with a batch capacity of 4 entity updates, we can include two sequences per batch. Temporal dependencies between the events (horizontal dotted lines) are not broken by batching. (right) Due to the interactions between states, we cannot consider isolating a subset of entities in a batch on dynamic graphs as we need the counterparty entity's state to update an entity's state when an event occurs. Batches are defined by time instead of by entity but this leads to more extreme gradient truncation along the time axis (time dependencies are broken by batching). In the example, with the same capacity of 4 entity updates, each batch now includes only a single event per entity.
Figure 2: Three different batching strategies illustrated. Four nodes with respective states $h_1 \dots h_4$ interact as in Figure \ref{['fig:seq_vs_graph']} until time $t_3$. Each blue box denotes an interaction, where the hidden states of the two interacting nodes are updated. On the left the approach of dai_deep_2017 where a different computation graph is built for each mini-batch. Due to the sequential processing within a batch, computational efficiency is lost. In the middle the t-Batching strategy of kumar_predicting_2019 which uses variable sized batches to guarantee no sequential dependencies within a batch, allowing parallel processing. On the right, the approach of rossi_temporal_2020 that uses fixed size batches and parallel processing at the cost of correctness, leading to inconsistent histories where the latent states used for the third event ignore the updates of the previous two events.
Figure 3: Visual depiction of synthetic task dynamics. A fixed size FIFO buffer with length M (4 in the picture) is used to store the internal state of each node. When a new edge between two nodes occurs (a), the input is averaged with the last elements of each counterparty node's buffer in order to determine the new number to be stored in each buffer (b). The sum of these last elements also determines the output for the edge (c).
Figure 4: Mean squared error (MSE) obtained for different sized GRU models trained with both F-BPTT and T-BPTT. The depicted error bars correspond to the min and max MSE values over 5 different random seeds used for parameter initialization and dynamic graph sampling. The solid lines correspond to the mean MSE over the different seeds.
Figure 5: The dynamical system over the hidden states can in general contain two components: a function $f$ encoding the evolution between interactions (blue), and a function $g$ encoding the updates due to interaction events (red). In this example interactions happen at times t1, t2 and t3 between nodes 2-3, 1-3 and 2-3 respectively.
...and 1 more figures

Mind the truncation gap: challenges of learning on dynamic graphs with recurrent architectures

TL;DR

Abstract

Mind the truncation gap: challenges of learning on dynamic graphs with recurrent architectures

Authors

TL;DR

Abstract

Table of Contents

Figures (6)