Table of Contents
Fetching ...

Cooperative Minibatching in Graph Neural Networks

Muhammed Fatih Balin, Dominique LaSalle, Ümit V. Çatalyürek

TL;DR

This work proposes a new approach to minibatch training called Cooperative Minibatching, which capitalizes on the fact that the size of the sampled subgraph is a concave function of the batch size, leading to significant reductions in the amount of work as batch sizes increase.

Abstract

Training large scale Graph Neural Networks (GNNs) requires significant computational resources, and the process is highly data-intensive. One of the most effective ways to reduce resource requirements is minibatch training coupled with graph sampling. GNNs have the unique property that items in a minibatch have overlapping data. However, the commonly implemented Independent Minibatching approach assigns each Processing Element (PE, i.e., cores and/or GPUs) its own minibatch to process, leading to duplicated computations and input data access across PEs. This amplifies the Neighborhood Explosion Phenomenon (NEP), which is the main bottleneck limiting scaling. To reduce the effects of NEP in the multi-PE setting, we propose a new approach called Cooperative Minibatching. Our approach capitalizes on the fact that the size of the sampled subgraph is a concave function of the batch size, leading to significant reductions in the amount of work as batch sizes increase. Hence, it is favorable for processors equipped with a fast interconnect to work on a large minibatch together as a single larger processor, instead of working on separate smaller minibatches, even though global batch size is identical. We also show how to take advantage of the same phenomenon in serial execution by generating dependent consecutive minibatches. Our experimental evaluations show up to 4x bandwidth savings for fetching vertex embeddings, by simply increasing this dependency without harming model convergence. Combining our proposed approaches, we achieve up to 64% speedup over Independent Minibatching on single-node multi-GPU systems, using same resources.

Cooperative Minibatching in Graph Neural Networks

TL;DR

This work proposes a new approach to minibatch training called Cooperative Minibatching, which capitalizes on the fact that the size of the sampled subgraph is a concave function of the batch size, leading to significant reductions in the amount of work as batch sizes increase.

Abstract

Training large scale Graph Neural Networks (GNNs) requires significant computational resources, and the process is highly data-intensive. One of the most effective ways to reduce resource requirements is minibatch training coupled with graph sampling. GNNs have the unique property that items in a minibatch have overlapping data. However, the commonly implemented Independent Minibatching approach assigns each Processing Element (PE, i.e., cores and/or GPUs) its own minibatch to process, leading to duplicated computations and input data access across PEs. This amplifies the Neighborhood Explosion Phenomenon (NEP), which is the main bottleneck limiting scaling. To reduce the effects of NEP in the multi-PE setting, we propose a new approach called Cooperative Minibatching. Our approach capitalizes on the fact that the size of the sampled subgraph is a concave function of the batch size, leading to significant reductions in the amount of work as batch sizes increase. Hence, it is favorable for processors equipped with a fast interconnect to work on a large minibatch together as a single larger processor, instead of working on separate smaller minibatches, even though global batch size is identical. We also show how to take advantage of the same phenomenon in serial execution by generating dependent consecutive minibatches. Our experimental evaluations show up to 4x bandwidth savings for fetching vertex embeddings, by simply increasing this dependency without harming model convergence. Combining our proposed approaches, we achieve up to 64% speedup over Independent Minibatching on single-node multi-GPU systems, using same resources.
Paper Structure (31 sections, 6 theorems, 17 equations, 8 figures, 7 tables, 1 algorithm)

This paper contains 31 sections, 6 theorems, 17 equations, 8 figures, 7 tables, 1 algorithm.

Key Result

Theorem 3.1

The work per epoch $\frac{E[|S^l|]}{|S^0|}$ required to train a GNN model using minibatch training is monotonically nonincreasing as the batch size $|S^0|$ increases.

Figures (8)

  • Figure 1: A smoothed dependent minibatching example for $\kappa=2$. The middle minibatch is interpolated between the two independent minibatches on the left and the right by interpolating the random numbers used during sampling.
  • Figure 2: Monotonicity of the work. x-axis shows the batch size, y-axis shows $\frac{E[|S^3|]}{|S^0|}$ (see \ref{['th:work_monotonicity']}) for node prediction (top row) and $E[|S^3|]$ (see \ref{['th:overlap_monotonicity']}) for edge prediction (bottom row), where $E[|S^3|]$ denotes the expected number of sampled vertices in the 3rd layer and $|S^0|$ denotes the batch size. RW stands for Random Walks, NS for Neighbor Sampling, and LABOR-0/* for the two different variants of the LABOR sampling algorithm described in \ref{['subsecc:graph_sampling']}.
  • Figure 3: The validation F1-score with NS sampled neighborhoods trained with the LABOR-0 sampling algorithm with $1024$ batch size and varying $\kappa$ dependent minibatches, $\kappa=\infty$ denotes infinite dependency, meaning the neighborhood sampled for a vertex stays static during training. See \ref{['figc:same_loss_cache_miss']} for cache miss rates. See \ref{['figc:same_loss_cache_sampling']} for the training loss and F1-score with the dependent sampler.
  • Figure 4: LRU-cache miss rates for LABOR-0 sampling algorithm with $1024$ batch size per GPU and varying $\kappa$ dependent minibatches, $\kappa=\infty$ denotes infinite dependency.
  • Figure 5: Monotonicity of the work. x axis shows the batch size, y axis shows $E[|S^3|]$ for node prediction (top row) and $\frac{E[|S^3|]}{|S^0|}$ for edge prediction (bottom row), where $E[|S^3|]$ denotes the expected number of vertices sampled in the 3rd layer and $|S^0|$ denotes the batch size. RW stands for Random Walks, NS stands for Neighbor Sampling, and LABOR-0/* stand for the two different variants of the LABOR sampling algorithm described in \ref{['subsecc:graph_sampling']}. Completes \ref{['figc:num_input_nodes']}.
  • ...and 3 more figures

Theorems & Definitions (11)

  • Theorem 3.1
  • proof
  • Theorem 3.2
  • proof
  • Theorem 3.3
  • Theorem A.1
  • proof
  • Theorem A.2
  • proof
  • Theorem A.3
  • ...and 1 more