Table of Contents
Fetching ...

NeuraChip: Accelerating GNN Computations with a Hash-based Decoupled Spatial Accelerator

Kaustubh Shivdikar, Nicolas Bohm Agostini, Malith Jayaweera, Gilbert Jonatan, Jose L. Abellan, Ajay Joshi, John Kim, David Kaeli

TL;DR

NeuraChip tackles GNN scalability on ultra-sparse graphs by decoupling multiplication and accumulation and introducing a sparsity-agnostic compute mapping (DRHM) with a rolling eviction to control memory bloat. It implements a tiled Gustavson-based SpGEMM with distinct NeuraCore and NeuraMem engines, providing an on-chip dataflow, hash-based accumulation, and an open-source cycle-accurate simulator NeuraSim for performance analysis. The approach yields substantial speedups over CPU and GPU implementations and favorable comparisons with prior SpGEMM and GNN accelerators, along with RTL-based area and power insight. Together, these contributions advance practical GNN acceleration and provide a reproducible framework for evaluating sparse-matrix workloads in hardware.

Abstract

Graph Neural Networks (GNNs) are emerging as a formidable tool for processing non-euclidean data across various domains, ranging from social network analysis to bioinformatics. Despite their effectiveness, their adoption has not been pervasive because of scalability challenges associated with large-scale graph datasets, particularly when leveraging message passing. To tackle these challenges, we introduce NeuraChip, a novel GNN spatial accelerator based on Gustavson's algorithm. NeuraChip decouples the multiplication and addition computations in sparse matrix multiplication. This separation allows for independent exploitation of their unique data dependencies, facilitating efficient resource allocation. We introduce a rolling eviction strategy to mitigate data idling in on-chip memory as well as address the prevalent issue of memory bloat in sparse graph computations. Furthermore, the compute resource load balancing is achieved through a dynamic reseeding hash-based mapping, ensuring uniform utilization of computing resources agnostic of sparsity patterns. Finally, we present NeuraSim, an open-source, cycle-accurate, multi-threaded, modular simulator for comprehensive performance analysis. Overall, NeuraChip presents a significant improvement, yielding an average speedup of 22.1x over Intel's MKL, 17.1x over NVIDIA's cuSPARSE, 16.7x over AMD's hipSPARSE, and 1.5x over prior state-of-the-art SpGEMM accelerator and 1.3x over GNN accelerator. The source code for our open-sourced simulator and performance visualizer is publicly accessible on GitHub https://neurachip.us

NeuraChip: Accelerating GNN Computations with a Hash-based Decoupled Spatial Accelerator

TL;DR

NeuraChip tackles GNN scalability on ultra-sparse graphs by decoupling multiplication and accumulation and introducing a sparsity-agnostic compute mapping (DRHM) with a rolling eviction to control memory bloat. It implements a tiled Gustavson-based SpGEMM with distinct NeuraCore and NeuraMem engines, providing an on-chip dataflow, hash-based accumulation, and an open-source cycle-accurate simulator NeuraSim for performance analysis. The approach yields substantial speedups over CPU and GPU implementations and favorable comparisons with prior SpGEMM and GNN accelerators, along with RTL-based area and power insight. Together, these contributions advance practical GNN acceleration and provide a reproducible framework for evaluating sparse-matrix workloads in hardware.

Abstract

Graph Neural Networks (GNNs) are emerging as a formidable tool for processing non-euclidean data across various domains, ranging from social network analysis to bioinformatics. Despite their effectiveness, their adoption has not been pervasive because of scalability challenges associated with large-scale graph datasets, particularly when leveraging message passing. To tackle these challenges, we introduce NeuraChip, a novel GNN spatial accelerator based on Gustavson's algorithm. NeuraChip decouples the multiplication and addition computations in sparse matrix multiplication. This separation allows for independent exploitation of their unique data dependencies, facilitating efficient resource allocation. We introduce a rolling eviction strategy to mitigate data idling in on-chip memory as well as address the prevalent issue of memory bloat in sparse graph computations. Furthermore, the compute resource load balancing is achieved through a dynamic reseeding hash-based mapping, ensuring uniform utilization of computing resources agnostic of sparsity patterns. Finally, we present NeuraSim, an open-source, cycle-accurate, multi-threaded, modular simulator for comprehensive performance analysis. Overall, NeuraChip presents a significant improvement, yielding an average speedup of 22.1x over Intel's MKL, 17.1x over NVIDIA's cuSPARSE, 16.7x over AMD's hipSPARSE, and 1.5x over prior state-of-the-art SpGEMM accelerator and 1.3x over GNN accelerator. The source code for our open-sourced simulator and performance visualizer is publicly accessible on GitHub https://neurachip.us
Paper Structure (29 sections, 4 equations, 17 figures, 5 tables, 2 algorithms)

This paper contains 29 sections, 4 equations, 17 figures, 5 tables, 2 algorithms.

Figures (17)

  • Figure 1: NeuraChip overview: (a) Aggregation phase of GCN, (b) NeuraCore generates partial products, (c) NeuraMem accumulates partial products, (d) writes back to HBM.
  • Figure 2: Matrix multiplication approaches, each showcasing varying degrees of data reuse for input and output matrices.
  • Figure 3: Multiplication and Accumulation phase techniques.
  • Figure 4: Implementation of tiled Gustavson's algorithm using NeuraCore for multiplication and NeuraMem for accumulation.
  • Figure 5: NeuraChip Architecture: Tile 64 configuration with 16 NeuraCores and 16 NeuraMems per tile.
  • ...and 12 more figures