Table of Contents
Fetching ...

AMPLE: Event-Driven Accelerator for Mixed-Precision Inference of Graph Neural Networks

Pedro Gimenes, Yiren Zhao, George Constantinides

TL;DR

The paper tackles the irregular memory access and workload imbalance in graph neural network inference on large, sparse graphs. It introduces AMPLE, an FPGA accelerator that uses an event-driven host-programmable flow, a heterogeneous on-chip network with dynamic resource allocation, and a node-centric, mixed-precision approach to inference. A node-level quantization strategy (DegreeQuant) paired with a node-centric prefetcher enables scalable, memory-efficient processing without storing the entire graph embeddings on-chip. On graphs ranging from $2\text{K}$ to $700\text{K}$ nodes, AMPLE achieves substantial speedups, averaging $243\times$ over CPU and $7.2\times$ over GPU baselines, demonstrating its potential for practical GNN acceleration on large graphs.

Abstract

Graph Neural Networks (GNNs) have recently gained attention due to their performance on non-Euclidean data. The use of custom hardware architectures proves particularly beneficial for GNNs due to their irregular memory access patterns, resulting from the sparse structure of graphs. However, existing FPGA accelerators are limited by their double buffering mechanism, which doesn't account for the irregular node distribution in typical graph datasets. To address this, we introduce \textbf{AMPLE} (Accelerated Message Passing Logic Engine), an FPGA accelerator leveraging a new event-driven programming flow. We develop a mixed-arithmetic architecture, enabling GNN inference to be quantized at a node-level granularity. Finally, prefetcher for data and instructions is implemented to optimize off-chip memory access and maximize node parallelism. Evaluation on citation and social media graph datasets ranging from $2$K to $700$K nodes showed a mean speedup of $243\times$ and $7.2\times$ against CPU and GPU counterparts, respectively.

AMPLE: Event-Driven Accelerator for Mixed-Precision Inference of Graph Neural Networks

TL;DR

The paper tackles the irregular memory access and workload imbalance in graph neural network inference on large, sparse graphs. It introduces AMPLE, an FPGA accelerator that uses an event-driven host-programmable flow, a heterogeneous on-chip network with dynamic resource allocation, and a node-centric, mixed-precision approach to inference. A node-level quantization strategy (DegreeQuant) paired with a node-centric prefetcher enables scalable, memory-efficient processing without storing the entire graph embeddings on-chip. On graphs ranging from to nodes, AMPLE achieves substantial speedups, averaging over CPU and over GPU baselines, demonstrating its potential for practical GNN acceleration on large graphs.

Abstract

Graph Neural Networks (GNNs) have recently gained attention due to their performance on non-Euclidean data. The use of custom hardware architectures proves particularly beneficial for GNNs due to their irregular memory access patterns, resulting from the sparse structure of graphs. However, existing FPGA accelerators are limited by their double buffering mechanism, which doesn't account for the irregular node distribution in typical graph datasets. To address this, we introduce \textbf{AMPLE} (Accelerated Message Passing Logic Engine), an FPGA accelerator leveraging a new event-driven programming flow. We develop a mixed-arithmetic architecture, enabling GNN inference to be quantized at a node-level granularity. Finally, prefetcher for data and instructions is implemented to optimize off-chip memory access and maximize node parallelism. Evaluation on citation and social media graph datasets ranging from K to K nodes showed a mean speedup of and against CPU and GPU counterparts, respectively.

Paper Structure

This paper contains 17 sections, 6 equations, 4 figures, 5 tables, 1 algorithm.

Figures (4)

  • Figure 1: AMPLE Top Level Diagram. Packets propagate through dimension-order routing in the Aggregation Engine's Network-on-Chip (shown in green), and are driven diagonally into the Transformation Engine's systolic array (shown in red). Dashed lines represent control flow interfaces, while solid lines represent data flow between units. Node embeddings are fetched through HBM, while instructions are stored in DRAM.
  • Figure 2: Microarchitecture of AGE configured with three supported precisions. NID requests drive the Aggregation Managers (AGMs), which receive fetched embeddings from the Feature Bank (See Figure \ref{['fig:fc_base_top']}). These are then transferred to the Aggregation Cores (AGCs) through the network. Aggregation results are then buffered by the Buffering Managers (BMs).
  • Figure 3: Fetch Tags in the Feature Bank make concurrent memory access requests in a two-stage process; First, the list of neighbouring node IDs is stored in the Address Queue, and these are then used as pointers for the neighbouring feature embeddings, which are stored in the Message Queue.
  • Figure 4: Inference speedup compared to Intel Xeon CPU baseline obtained on the RTX A6000 GPU and AMPLE simulation. The GPU shows an average speedup of $29.8\times$, $37.8\times$ and $26.7\times$ across all datasets for GCN, GIN and GraphSAGE, respectively. Equivalent speedups on AMPLE were $361.3\times$, $285.8\times$ and $81.7\times$.