Table of Contents
Fetching ...

Retrofitting Temporal Graph Neural Networks with Transformer

Qiang Huang, Xiao Yan, Xin Wang, Susie Xi Rao, Zhichao Han, Fangcheng Fu, Wentao Zhang, Jiawei Jiang

TL;DR

TF-TGN retrofits temporal graph neural networks with a Transformer decoder, recasting temporal message aggregation as autoregressive sequence modeling via suffix infilling and causal masking attention. It couples this modeling approach with a parallel CSR conversion and sampling pipeline and leverages high-performance Transformer kernels and distributed training to achieve large speedups. Empirical results on 9 real-world graphs demonstrate competitive accuracy with state-of-the-art TGNNs and significant end-to-end training time reductions, including dramatic improvements in CSR conversion and sampling. The work provides a practical pathway to scalable, transformer-backed TGNNs capable of handling graphs with billions of edges in dynamic link-prediction tasks.

Abstract

Temporal graph neural networks (TGNNs) outperform regular GNNs by incorporating time information into graph-based operations. However, TGNNs adopt specialized models (e.g., TGN, TGAT, and APAN ) and require tailored training frameworks (e.g., TGL and ETC). In this paper, we propose TF-TGN, which uses Transformer decoder as the backbone model for TGNN to enjoy Transformer's codebase for efficient training. In particular, Transformer achieves tremendous success for language modeling, and thus the community developed high-performance kernels (e.g., flash-attention and memory-efficient attention) and efficient distributed training schemes (e.g., PyTorch FSDP, DeepSpeed, and Megatron-LM). We observe that TGNN resembles language modeling, i.e., the message aggregation operation between chronologically occurring nodes and their temporal neighbors in TGNNs can be structured as sequence modeling. Beside this similarity, we also incorporate a series of algorithm designs including suffix infilling, temporal graph attention with self-loop, and causal masking self-attention to make TF-TGN work. During training, existing systems are slow in transforming the graph topology and conducting graph sampling. As such, we propose methods to parallelize the CSR format conversion and graph sampling. We also adapt Transformer codebase to train TF-TGN efficiently with multiple GPUs. We experiment with 9 graphs and compare with 2 state-of-the-art TGNN training frameworks. The results show that TF-TGN can accelerate training by over 2.20 while providing comparable or even superior accuracy to existing SOTA TGNNs. TF-TGN is available at https://github.com/qianghuangwhu/TF-TGN.

Retrofitting Temporal Graph Neural Networks with Transformer

TL;DR

TF-TGN retrofits temporal graph neural networks with a Transformer decoder, recasting temporal message aggregation as autoregressive sequence modeling via suffix infilling and causal masking attention. It couples this modeling approach with a parallel CSR conversion and sampling pipeline and leverages high-performance Transformer kernels and distributed training to achieve large speedups. Empirical results on 9 real-world graphs demonstrate competitive accuracy with state-of-the-art TGNNs and significant end-to-end training time reductions, including dramatic improvements in CSR conversion and sampling. The work provides a practical pathway to scalable, transformer-backed TGNNs capable of handling graphs with billions of edges in dynamic link-prediction tasks.

Abstract

Temporal graph neural networks (TGNNs) outperform regular GNNs by incorporating time information into graph-based operations. However, TGNNs adopt specialized models (e.g., TGN, TGAT, and APAN ) and require tailored training frameworks (e.g., TGL and ETC). In this paper, we propose TF-TGN, which uses Transformer decoder as the backbone model for TGNN to enjoy Transformer's codebase for efficient training. In particular, Transformer achieves tremendous success for language modeling, and thus the community developed high-performance kernels (e.g., flash-attention and memory-efficient attention) and efficient distributed training schemes (e.g., PyTorch FSDP, DeepSpeed, and Megatron-LM). We observe that TGNN resembles language modeling, i.e., the message aggregation operation between chronologically occurring nodes and their temporal neighbors in TGNNs can be structured as sequence modeling. Beside this similarity, we also incorporate a series of algorithm designs including suffix infilling, temporal graph attention with self-loop, and causal masking self-attention to make TF-TGN work. During training, existing systems are slow in transforming the graph topology and conducting graph sampling. As such, we propose methods to parallelize the CSR format conversion and graph sampling. We also adapt Transformer codebase to train TF-TGN efficiently with multiple GPUs. We experiment with 9 graphs and compare with 2 state-of-the-art TGNN training frameworks. The results show that TF-TGN can accelerate training by over 2.20 while providing comparable or even superior accuracy to existing SOTA TGNNs. TF-TGN is available at https://github.com/qianghuangwhu/TF-TGN.
Paper Structure (23 sections, 17 equations, 10 figures, 4 tables, 1 algorithm)

This paper contains 23 sections, 17 equations, 10 figures, 4 tables, 1 algorithm.

Figures (10)

  • Figure 1: Comparison of normalized values for GPU memory, RAM, and runtime between memory-based and attention-based TGNNs.
  • Figure 2: (a) The neighbors of a node sampled by the temporal sampler; (b) Suffix infilling the sampled temporal neighbors using the node itself as a sequence $\mathcal{X}_u(t)$; (c) Temporal graph attention with self-loop; (d) Attention with causal masking of transfomer decoder on the suffix infilling sequence.
  • Figure 3: (a) Batch of suffix infilling temporal sequences; (b) The causal masking for the batch of suffix infilling temporal sequences. Cells in gray are padding index zero and are masked in the self-attention mechanism.
  • Figure 4: Comparison of the normalized time for the CSR converting, temporal neighbor sampling, and model training when training different TGNNs integrated with the TGL framework on the GDELT datasets.
  • Figure 5: The batch training strategy of TGNNs.
  • ...and 5 more figures