Table of Contents
Fetching ...

Haste Makes Waste: A Simple Approach for Scaling Graph Neural Networks

Rui Xue, Tong Zhao, Neil Shah, Xiaorui Liu

TL;DR

This paper proposes a simple yet highly effective training algorithm (REST) to effectively reduce feature staleness, which leads to significantly improved performance and convergence across varying batch sizes.

Abstract

Graph neural networks (GNNs) have demonstrated remarkable success in graph representation learning, and various sampling approaches have been proposed to scale GNNs to applications with large-scale graphs. A class of promising GNN training algorithms take advantage of historical embeddings to reduce the computation and memory cost while maintaining the model expressiveness of GNNs. However, they incur significant computation bias due to the stale feature history. In this paper, we provide a comprehensive analysis of their staleness and inferior performance on large-scale problems. Motivated by our discoveries, we propose a simple yet highly effective training algorithm (REST) to effectively reduce feature staleness, which leads to significantly improved performance and convergence across varying batch sizes. The proposed algorithm seamlessly integrates with existing solutions, boasting easy implementation, while comprehensive experiments underscore its superior performance and efficiency on large-scale benchmarks. Specifically, our improvements to state-of-the-art historical embedding methods result in a 2.7% and 3.6% performance enhancement on the ogbn-papers100M and ogbn-products dataset respectively, accompanied by notably accelerated convergence.

Haste Makes Waste: A Simple Approach for Scaling Graph Neural Networks

TL;DR

This paper proposes a simple yet highly effective training algorithm (REST) to effectively reduce feature staleness, which leads to significantly improved performance and convergence across varying batch sizes.

Abstract

Graph neural networks (GNNs) have demonstrated remarkable success in graph representation learning, and various sampling approaches have been proposed to scale GNNs to applications with large-scale graphs. A class of promising GNN training algorithms take advantage of historical embeddings to reduce the computation and memory cost while maintaining the model expressiveness of GNNs. However, they incur significant computation bias due to the stale feature history. In this paper, we provide a comprehensive analysis of their staleness and inferior performance on large-scale problems. Motivated by our discoveries, we propose a simple yet highly effective training algorithm (REST) to effectively reduce feature staleness, which leads to significantly improved performance and convergence across varying batch sizes. The proposed algorithm seamlessly integrates with existing solutions, boasting easy implementation, while comprehensive experiments underscore its superior performance and efficiency on large-scale benchmarks. Specifically, our improvements to state-of-the-art historical embedding methods result in a 2.7% and 3.6% performance enhancement on the ogbn-papers100M and ogbn-products dataset respectively, accompanied by notably accelerated convergence.
Paper Structure (31 sections, 3 theorems, 10 equations, 22 figures, 14 tables, 1 algorithm)

This paper contains 31 sections, 3 theorems, 10 equations, 22 figures, 14 tables, 1 algorithm.

Key Result

Theorem 1

Consider a L-layers GNN $f_\theta^{(l)}(h)$ with Lipschitz constant $\alpha^{(l)}$, UPDATE$^{(l)}_\theta$ function with Lipschitz constant $\beta$, l =1, …, L. $\nabla L_{\theta}$ has Lipschitz constant $\varepsilon$. If $~\forall v\in V$, $||\Bar{h}^{(l)} - h^{(l)}||$ denotes the staleness between

Figures (22)

  • Figure 1: GAS and FM exhibit inferior performance and slower convergence, especially on larger datasets (i.e., ogbn-products) or small batch size (i.e., "Small").
  • Figure 2: Embedding Memory persistence (GAS).
  • Figure 3: Approximation error (GAS).
  • Figure 4: Training process for the proposed REST technique. (1) F mini-batches $\mathbf{B}_{1} \dots \mathbf{B}_{F}$ (blue ellipse) are executed without computing gradients to update memory table (2) Another one mini-batch $\Tilde{B}$ (yellow ellipse) is processed with gradient computation to update the model parameters.
  • Figure 5: Convergence on ogbn-arxiv.
  • ...and 17 more figures

Theorems & Definitions (4)

  • Theorem 1: Gradient Approximation Error
  • Theorem 2
  • Theorem 1: Approximation Error
  • proof