StreamTGN: A GPU-Efficient Serving System for Streaming Temporal Graph Neural Networks

Lingling Zhang; Pengpeng Qiao; Zhiwei Zhang; Ye Yuan; Guoren Wang

StreamTGN: A GPU-Efficient Serving System for Streaming Temporal Graph Neural Networks

Lingling Zhang, Pengpeng Qiao, Zhiwei Zhang, Ye Yuan, Guoren Wang

Abstract

Temporal Graph Neural Networks (TGNs) achieve state-of-the-art performance on dynamic graph tasks, yet existing systems focus exclusively on accelerating training -- at inference time, every new edge triggers $O(|V|)$ embedding updates even though only a small fraction of nodes are affected. We present \textbf{StreamTGN}, the first streaming TGN inference system exploiting the inherent locality of temporal graph updates: in an $L$-layer TGN, a new edge affects only nodes within $L$ hops of the endpoints, typically less than 0.2\% on million-node graphs. StreamTGN maintains persistent GPU-resident node memory and uses dirty-flag propagation to identify the affected set $\mathcal{A}$, reducing per-batch complexity from $O(|V|)$ to $O(|\mathcal{A}|)$ with zero accuracy loss. Drift-aware adaptive rebuild scheduling and batched streaming with relaxed ordering further maximize throughput. Experiments on eight temporal graphs (2K--2.6M nodes) show 4.5$\times$--739$\times$ speedup for TGN and up to 4,207$\times$ for TGAT, with identical accuracy. StreamTGN is orthogonal to training optimizations: combining SWIFT with StreamTGN yields 24$\times$ end-to-end speedup across three architectures (TGN, TGAT, DySAT).

StreamTGN: A GPU-Efficient Serving System for Streaming Temporal Graph Neural Networks

Abstract

embedding updates even though only a small fraction of nodes are affected. We present \textbf{StreamTGN}, the first streaming TGN inference system exploiting the inherent locality of temporal graph updates: in an

-layer TGN, a new edge affects only nodes within

hops of the endpoints, typically less than 0.2\% on million-node graphs. StreamTGN maintains persistent GPU-resident node memory and uses dirty-flag propagation to identify the affected set

, reducing per-batch complexity from

with zero accuracy loss. Drift-aware adaptive rebuild scheduling and batched streaming with relaxed ordering further maximize throughput. Experiments on eight temporal graphs (2K--2.6M nodes) show 4.5

--739

speedup for TGN and up to 4,207

for TGAT, with identical accuracy. StreamTGN is orthogonal to training optimizations: combining SWIFT with StreamTGN yields 24

end-to-end speedup across three architectures (TGN, TGAT, DySAT).

Paper Structure (70 sections, 13 theorems, 54 equations, 8 figures, 12 tables, 1 algorithm)

This paper contains 70 sections, 13 theorems, 54 equations, 8 figures, 12 tables, 1 algorithm.

Introduction
Preliminaries
Temporal Graph Definition
TGNN Architecture
Stage 1: Time Encoding
Stage 2: Message Computation
Stage 3: Message Aggregation
Stage 4: Memory Update
Stage 5: Embedding Generation
Training and Inference Flow
Performance Analysis
From Logical Components to Profiling Stages
Stage-wise Analysis of Existing Methods
① Neighbor Sampling.
② Feature Retrieval.
...and 55 more sections

Key Result

theorem 1

For a TGN with $K$ attention layers and sampling fanout $L$, computing embeddings for all $n$ nodes over $m$ temporal edges costs: where the first term accounts for $K$-layer temporal attention (each node attends over $L$ neighbors per layer with $O(d^2)$ per attention head) and the second term accounts for GRU-based memory updates.

Figures (8)

Figure 1: Overview of TGN training and inference. Training runs offline and infrequently; inference runs continuously at scale. Even a small improvement in inference latency yields enormous savings: $(10\text{\,ms} - 5\text{\,ms}) \times 10^8 \text{ queries/day} \approx 10^6 \text{ seconds/day}$.
Figure 2: Overview of TGNN architecture including five modules with interleaved neural update and aggregation operations.
Figure 3: The distributions of processing time across the five profiling stages for TGN and TGAT on four datasets.
Figure 4: Overview of StreamTGN. The architecture comprises a GPU-resident hybrid data structure (left) and five incremental computation stages (right) that operate directly on the persistent state.
Figure 5: Overview of the GPU-resident hybrid data structure. Three persistent components---Temporal Adjacency List, Embedding Cache, and Node Memory---reside on the GPU across batches to enable incremental computation. The transient Edge Queue (dashed border) buffers streaming input and flushes at batch boundaries.
...and 3 more figures

Theorems & Definitions (19)

definition 1: Temporal Graph
definition 2: Node Set and Features
definition 3: Temporal Neighborhood
definition 4: TGNN Learning Problem
theorem 1: Full Computation Complexity
theorem 2: Incremental Computation Complexity
theorem 3: End-to-End Speedup
theorem 4: Optimality Condition
corollary 1
theorem 5: Lower Bound
...and 9 more

StreamTGN: A GPU-Efficient Serving System for Streaming Temporal Graph Neural Networks

Abstract

StreamTGN: A GPU-Efficient Serving System for Streaming Temporal Graph Neural Networks

Authors

Abstract

Table of Contents

Key Result

Figures (8)

Theorems & Definitions (19)