Table of Contents
Fetching ...

Distributed Training of Large Graph Neural Networks with Variable Communication Rates

Juan Cervino, Md Asadullah Turja, Hesham Mostafa, Nageen Himayat, Alejandro Ribeiro

TL;DR

The paper tackles the bottleneck of inter-machine communication in distributed GNN training on large graphs. It introduces VARCO, a variable compression scheme that progressively reduces communication by compressing boundary activations, with a convergence analysis showing it can reach the full-communication solution as compression errors decay. Empirically, VARCO matches or surpasses full communication in accuracy while using significantly fewer communicated bytes across random and METIS graph partitions on large datasets, and it outperforms fixed compression strategies in efficiency. This work provides a practical, partition-agnostic approach to scalable graph learning with strong theoretical and empirical guarantees.

Abstract

Training Graph Neural Networks (GNNs) on large graphs presents unique challenges due to the large memory and computing requirements. Distributed GNN training, where the graph is partitioned across multiple machines, is a common approach to training GNNs on large graphs. However, as the graph cannot generally be decomposed into small non-interacting components, data communication between the training machines quickly limits training speeds. Compressing the communicated node activations by a fixed amount improves the training speeds, but lowers the accuracy of the trained GNN. In this paper, we introduce a variable compression scheme for reducing the communication volume in distributed GNN training without compromising the accuracy of the learned model. Based on our theoretical analysis, we derive a variable compression method that converges to a solution equivalent to the full communication case, for all graph partitioning schemes. Our empirical results show that our method attains a comparable performance to the one obtained with full communication. We outperform full communication at any fixed compression ratio for any communication budget.

Distributed Training of Large Graph Neural Networks with Variable Communication Rates

TL;DR

The paper tackles the bottleneck of inter-machine communication in distributed GNN training on large graphs. It introduces VARCO, a variable compression scheme that progressively reduces communication by compressing boundary activations, with a convergence analysis showing it can reach the full-communication solution as compression errors decay. Empirically, VARCO matches or surpasses full communication in accuracy while using significantly fewer communicated bytes across random and METIS graph partitions on large datasets, and it outperforms fixed compression strategies in efficiency. This work provides a practical, partition-agnostic approach to scalable graph learning with strong theoretical and empirical guarantees.

Abstract

Training Graph Neural Networks (GNNs) on large graphs presents unique challenges due to the large memory and computing requirements. Distributed GNN training, where the graph is partitioned across multiple machines, is a common approach to training GNNs on large graphs. However, as the graph cannot generally be decomposed into small non-interacting components, data communication between the training machines quickly limits training speeds. Compressing the communicated node activations by a fixed amount improves the training speeds, but lowers the accuracy of the trained GNN. In this paper, we introduce a variable compression scheme for reducing the communication volume in distributed GNN training without compromising the accuracy of the learned model. Based on our theoretical analysis, we derive a variable compression method that converges to a solution equivalent to the full communication case, for all graph partitioning schemes. Our empirical results show that our method attains a comparable performance to the one obtained with full communication. We outperform full communication at any fixed compression ratio for any communication budget.

Paper Structure

This paper contains 17 sections, 36 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Example of partitioning a graph with $9$ nodes into $3$ machines. Each machine only stores the features of the nodes in their corresponding partition.
  • Figure 2: To compute a gradient step, we need to gather the data. To do so, each machine starts by computing the activations of the local nodes. Then, these activations are compressed and communicated to adjacent machines. Once all the activations are communicated, each machine decompresses the data from the compressed nodes.
  • Figure 3: Accuracy per iteration for random partitioning with $16$ servers.
  • Figure 4: Accuracy as a function of the number of servers.
  • Figure 5: Accuracy per floating points communicated for random partitioning with $16$ servers.

Theorems & Definitions (7)

  • proof
  • proof
  • proof : of Lemma \ref{['lemma:func_diff']}
  • proof : of Lemma \ref{['lemma:grad_diff']}
  • proof : of Lemma \ref{['lemma:lipschitz_loss_wrt_params']}
  • proof : of Lemma \ref{['lemma:submartingale']}
  • proof : of Proposition \ref{['prop:fixed_compression']}