Table of Contents
Fetching ...

StreamTGN: A GPU-Efficient Serving System for Streaming Temporal Graph Neural Networks

Lingling Zhang, Pengpeng Qiao, Zhiwei Zhang, Ye Yuan, Guoren Wang

Abstract

Temporal Graph Neural Networks (TGNs) achieve state-of-the-art performance on dynamic graph tasks, yet existing systems focus exclusively on accelerating training -- at inference time, every new edge triggers $O(|V|)$ embedding updates even though only a small fraction of nodes are affected. We present \textbf{StreamTGN}, the first streaming TGN inference system exploiting the inherent locality of temporal graph updates: in an $L$-layer TGN, a new edge affects only nodes within $L$ hops of the endpoints, typically less than 0.2\% on million-node graphs. StreamTGN maintains persistent GPU-resident node memory and uses dirty-flag propagation to identify the affected set $\mathcal{A}$, reducing per-batch complexity from $O(|V|)$ to $O(|\mathcal{A}|)$ with zero accuracy loss. Drift-aware adaptive rebuild scheduling and batched streaming with relaxed ordering further maximize throughput. Experiments on eight temporal graphs (2K--2.6M nodes) show 4.5$\times$--739$\times$ speedup for TGN and up to 4,207$\times$ for TGAT, with identical accuracy. StreamTGN is orthogonal to training optimizations: combining SWIFT with StreamTGN yields 24$\times$ end-to-end speedup across three architectures (TGN, TGAT, DySAT).

StreamTGN: A GPU-Efficient Serving System for Streaming Temporal Graph Neural Networks

Abstract

Temporal Graph Neural Networks (TGNs) achieve state-of-the-art performance on dynamic graph tasks, yet existing systems focus exclusively on accelerating training -- at inference time, every new edge triggers embedding updates even though only a small fraction of nodes are affected. We present \textbf{StreamTGN}, the first streaming TGN inference system exploiting the inherent locality of temporal graph updates: in an -layer TGN, a new edge affects only nodes within hops of the endpoints, typically less than 0.2\% on million-node graphs. StreamTGN maintains persistent GPU-resident node memory and uses dirty-flag propagation to identify the affected set , reducing per-batch complexity from to with zero accuracy loss. Drift-aware adaptive rebuild scheduling and batched streaming with relaxed ordering further maximize throughput. Experiments on eight temporal graphs (2K--2.6M nodes) show 4.5--739 speedup for TGN and up to 4,207 for TGAT, with identical accuracy. StreamTGN is orthogonal to training optimizations: combining SWIFT with StreamTGN yields 24 end-to-end speedup across three architectures (TGN, TGAT, DySAT).
Paper Structure (70 sections, 13 theorems, 54 equations, 8 figures, 12 tables, 1 algorithm)

This paper contains 70 sections, 13 theorems, 54 equations, 8 figures, 12 tables, 1 algorithm.

Key Result

theorem 1

For a TGN with $K$ attention layers and sampling fanout $L$, computing embeddings for all $n$ nodes over $m$ temporal edges costs: where the first term accounts for $K$-layer temporal attention (each node attends over $L$ neighbors per layer with $O(d^2)$ per attention head) and the second term accounts for GRU-based memory updates.

Figures (8)

  • Figure 1: Overview of TGN training and inference. Training runs offline and infrequently; inference runs continuously at scale. Even a small improvement in inference latency yields enormous savings: $(10\text{\,ms} - 5\text{\,ms}) \times 10^8 \text{ queries/day} \approx 10^6 \text{ seconds/day}$.
  • Figure 2: Overview of TGNN architecture including five modules with interleaved neural update and aggregation operations.
  • Figure 3: The distributions of processing time across the five profiling stages for TGN and TGAT on four datasets.
  • Figure 4: Overview of StreamTGN. The architecture comprises a GPU-resident hybrid data structure (left) and five incremental computation stages (right) that operate directly on the persistent state.
  • Figure 5: Overview of the GPU-resident hybrid data structure. Three persistent components---Temporal Adjacency List, Embedding Cache, and Node Memory---reside on the GPU across batches to enable incremental computation. The transient Edge Queue (dashed border) buffers streaming input and flushes at batch boundaries.
  • ...and 3 more figures

Theorems & Definitions (19)

  • definition 1: Temporal Graph
  • definition 2: Node Set and Features
  • definition 3: Temporal Neighborhood
  • definition 4: TGNN Learning Problem
  • theorem 1: Full Computation Complexity
  • theorem 2: Incremental Computation Complexity
  • theorem 3: End-to-End Speedup
  • theorem 4: Optimality Condition
  • corollary 1
  • theorem 5: Lower Bound
  • ...and 9 more