PRES: Toward Scalable Memory-Based Dynamic Graph Neural Networks

Junwei Su; Difan Zou; Chuan Wu

PRES: Toward Scalable Memory-Based Dynamic Graph Neural Networks

Junwei Su, Difan Zou, Chuan Wu

TL;DR

Training memory-based dynamic graph neural networks (MDGNNs) is hindered by temporal discontinuity in batch processing, which disrupts chronological memory updates and limits data parallelism. The authors introduce PRES, an iterative prediction-correction framework with a memory-coherence smoothing objective, to enable significantly larger temporal batches without sacrificing performance. They provide theoretical insights on the impact of temporal batch size on variance and convergence, and demonstrate that PRES achieves up to a 4x increase in temporal batch size and about 3.4x speed-up on benchmarks. Practically, PRES extends the scalability of MDGNNs to industrial-scale dynamic graphs by improving training efficiency while preserving accuracy.

Abstract

Memory-based Dynamic Graph Neural Networks (MDGNNs) are a family of dynamic graph neural networks that leverage a memory module to extract, distill, and memorize long-term temporal dependencies, leading to superior performance compared to memory-less counterparts. However, training MDGNNs faces the challenge of handling entangled temporal and structural dependencies, requiring sequential and chronological processing of data sequences to capture accurate temporal patterns. During the batch training, the temporal data points within the same batch will be processed in parallel, while their temporal dependencies are neglected. This issue is referred to as temporal discontinuity and restricts the effective temporal batch size, limiting data parallelism and reducing MDGNNs' flexibility in industrial applications. This paper studies the efficient training of MDGNNs at scale, focusing on the temporal discontinuity in training MDGNNs with large temporal batch sizes. We first conduct a theoretical study on the impact of temporal batch size on the convergence of MDGNN training. Based on the analysis, we propose PRES, an iterative prediction-correction scheme combined with a memory coherence learning objective to mitigate the effect of temporal discontinuity, enabling MDGNNs to be trained with significantly larger temporal batches without sacrificing generalization performance. Experimental results demonstrate that our approach enables up to a 4x larger temporal batch (3.4x speed-up) during MDGNN training.

PRES: Toward Scalable Memory-Based Dynamic Graph Neural Networks

TL;DR

Abstract

Paper Structure (47 sections, 4 theorems, 51 equations, 19 figures, 3 tables, 2 algorithms)

This paper contains 47 sections, 4 theorems, 51 equations, 19 figures, 3 tables, 2 algorithms.

INTRODUCTION
RELATED WORK
Dynamic Graph Representation Learning.
Mini-Batch in Stochastic Gradient Descent (SGD).
PRELIMINARY AND BACKGROUND
Event-based Representation of Dynamic Graphs.
Memory-based Dynamic Graph Neural Network (MDGNN).
Training MDGNNs.
temporal discontinuity and Pending Events.
THEORETICAL ANALYSIS OF MDGNN TRAINING
PREdict-to-Smooth (PRES) Method
Iterative Prediction-Correction Scheme
Memory Coherence Smoothing
Theoretical Discussion of PRES
Variance.
...and 32 more sections

Key Result

Theorem 1

Let $\mathcal{E}$ be a given event set and $b$ be the temporal batch size. For a given MDGNN parameterized by $\theta$ with training procedure of Eq. eq:mem_dgnn_negsamp, we have $\mathbb E[\|\nabla \mathcal{\hat{L}}(\theta) - \nabla \mathcal{L}(\theta)\|^2] \geq \frac{|\mathcal{E}|}{b} \sigma_{\m

Figures (19)

Figure 1: Illustration of the MDGNN process. Arrows of the same colour represent simultaneous operations. (1) Temporal events are sequentially processed and transformed into messages. (2) Arrived messages update the previous memory. (3) The updated memory is used to compute embeddings. (4) Time-dependent embeddings can be utilized for downstream tasks within the system.
Figure 2: Fig. \ref{['fig:train']} depicts the training flow of MDGNN. The incoming batch serves as training samples for updating the model and memory for the subsequent batch. Fig. \ref{['fig:temporal_dep']} visualizes the temporal discontinuity that arises from pending events within the same temporal batch and $t^-$ indicates the moments before $t$. The top section showcases two pending events sharing a common vertex. The middle section demonstrates the transition of memory states when events are sequentially processed according to temporal order. The bottom section illustrates the transition when events are processed in parallel (large batch). The grey colour indicates unobserved or altered memory states and the dotted line indicates missing transition, resulting in temporal discontinuity.
Figure 3: Performance of baselines under different batch sizes. The x-axis represents the batch size, while the y-axis represents the average precision (AP). The results are averaged over five trials.
Figure 4: Performance of baseline methods with and without PRES under different batch sizes on WIKI dataset. The x-axis represents the batch size (multiplied by 100), while the y-axis represents the average precision (AP). The results are averaged over five trials with $\beta = 0.1$ for PRES.
Figure 5: Statistical efficiency of baseline method w./w.o PERS. x-axis is the training iteration and y-axis is the average precision. $\beta = 0.1$ is used in PRES.
...and 14 more figures

Theorems & Definitions (10)

Definition 1: Pending Event
Definition 2: Pending Set
Theorem 1
Definition 3: Memory Coherence
Theorem 2
Proposition 1: Informal
proof
proof
Proposition 2
proof

PRES: Toward Scalable Memory-Based Dynamic Graph Neural Networks

TL;DR

Abstract

PRES: Toward Scalable Memory-Based Dynamic Graph Neural Networks

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (19)

Theorems & Definitions (10)