Table of Contents
Fetching ...

Efficient Message Passing Architecture for GCN Training on HBM-based FPGAs with Orthogonal Topology On-Chip Networks

Qizhe Wu, Letian Zhao, Yuchen Gui, Huawen Liang Xiaotian Wang

TL;DR

A message-passing architecture that leverages NUMA-based memory access properties and employs a parallel multicast routing algorithm based on a 4-D hypercube network within the accelerator for efficient message passing in graphs is proposed.

Abstract

Graph Convolutional Networks (GCNs) are state-of-the-art deep learning models for representation learning on graphs. However, the efficient training of GCNs is hampered by constraints in memory capacity and bandwidth, compounded by the irregular data flow that results in communication bottlenecks. To address these challenges, we propose a message-passing architecture that leverages NUMA-based memory access properties and employs a parallel multicast routing algorithm based on a 4-D hypercube network within the accelerator for efficient message passing in graphs. Additionally, we have re-engineered the backpropagation algorithm specific to GCNs within our proposed accelerator. This redesign strategically mitigates the memory demands prevalent during the training phase and diminishes the computational overhead associated with the transposition of extensive matrices. Compared to the state-of-the-art HP-GNN architecture we achieved a performance improvement of $1.03\times \sim 1.81\times$.

Efficient Message Passing Architecture for GCN Training on HBM-based FPGAs with Orthogonal Topology On-Chip Networks

TL;DR

A message-passing architecture that leverages NUMA-based memory access properties and employs a parallel multicast routing algorithm based on a 4-D hypercube network within the accelerator for efficient message passing in graphs is proposed.

Abstract

Graph Convolutional Networks (GCNs) are state-of-the-art deep learning models for representation learning on graphs. However, the efficient training of GCNs is hampered by constraints in memory capacity and bandwidth, compounded by the irregular data flow that results in communication bottlenecks. To address these challenges, we propose a message-passing architecture that leverages NUMA-based memory access properties and employs a parallel multicast routing algorithm based on a 4-D hypercube network within the accelerator for efficient message passing in graphs. Additionally, we have re-engineered the backpropagation algorithm specific to GCNs within our proposed accelerator. This redesign strategically mitigates the memory demands prevalent during the training phase and diminishes the computational overhead associated with the transposition of extensive matrices. Compared to the state-of-the-art HP-GNN architecture we achieved a performance improvement of .

Paper Structure

This paper contains 18 sections, 10 equations, 12 figures, 3 tables, 1 algorithm.

Figures (12)

  • Figure 1: HBM bandwidth benchmarking. (a) AXI port reads for local channels. (b)(c)(d) Concurrent reads for non-local channels.
  • Figure 2: Overall architecture of GNN Training.
  • Figure 3: Core architecture.
  • Figure 4: 4-D hypercube on-chip network.
  • Figure 5: Switch model in each core.
  • ...and 7 more figures