AMPLE: Event-Driven Accelerator for Mixed-Precision Inference of Graph Neural Networks
Pedro Gimenes, Yiren Zhao, George Constantinides
TL;DR
The paper tackles the irregular memory access and workload imbalance in graph neural network inference on large, sparse graphs. It introduces AMPLE, an FPGA accelerator that uses an event-driven host-programmable flow, a heterogeneous on-chip network with dynamic resource allocation, and a node-centric, mixed-precision approach to inference. A node-level quantization strategy (DegreeQuant) paired with a node-centric prefetcher enables scalable, memory-efficient processing without storing the entire graph embeddings on-chip. On graphs ranging from $2\text{K}$ to $700\text{K}$ nodes, AMPLE achieves substantial speedups, averaging $243\times$ over CPU and $7.2\times$ over GPU baselines, demonstrating its potential for practical GNN acceleration on large graphs.
Abstract
Graph Neural Networks (GNNs) have recently gained attention due to their performance on non-Euclidean data. The use of custom hardware architectures proves particularly beneficial for GNNs due to their irregular memory access patterns, resulting from the sparse structure of graphs. However, existing FPGA accelerators are limited by their double buffering mechanism, which doesn't account for the irregular node distribution in typical graph datasets. To address this, we introduce \textbf{AMPLE} (Accelerated Message Passing Logic Engine), an FPGA accelerator leveraging a new event-driven programming flow. We develop a mixed-arithmetic architecture, enabling GNN inference to be quantized at a node-level granularity. Finally, prefetcher for data and instructions is implemented to optimize off-chip memory access and maximize node parallelism. Evaluation on citation and social media graph datasets ranging from $2$K to $700$K nodes showed a mean speedup of $243\times$ and $7.2\times$ against CPU and GPU counterparts, respectively.
