Table of Contents
Fetching ...

GraNNite: Enabling High-Performance Execution of Graph Neural Networks on Resource-Constrained Neural Processing Units

Arghadip Das, Shamik Kundu, Arnab Raha, Soumendu Ghosh, Deepak Mathaikutty, Vijay Raghunathan

TL;DR

GraNNite tackles the challenge of running graph neural networks on resource-constrained NPUs at the edge by presenting a hardware-aware three-step framework. It partitions workloads (GraphSplit), optimizes compute and memory (EffOp, GraSp, PreG, SymG, CacheG), and introduces accuracy–energy tradeoffs (QuantGr, GrAx1–GrAx3) to achieve real-time edge inference. Core contributions include a suite of techniques for enabling GNNs on NPUs and substantial empirical gains on Intel Core Ultra AI PCs, with speedups up to 7.6× and energy improvements up to 8.6× over CPU/GPU baselines. The work demonstrates that no hardware changes are needed, and the approach generalizes to other models and accelerators, enhancing the practicality of edge GNN deployment for tasks like RAG, event-driven vision, and streaming graph analytics.

Abstract

Graph Neural Networks (GNNs) are vital for learning from graph-structured data, enabling applications in network analysis, recommendation systems, and speech analytics. Deploying them on edge devices like client PCs and laptops enhances real-time processing, privacy, and cloud independence. GNNs aid Retrieval-Augmented Generation (RAG) for Large Language Models (LLMs) and enable event-based vision tasks. However, irregular memory access, sparsity, and dynamic structures cause high latency and energy overhead on resource-constrained devices. While modern edge processors integrate CPUs, GPUs, and NPUs, NPUs designed for data-parallel tasks struggle with irregular GNN computations. We introduce GraNNite, the first hardware-aware framework optimizing GNN execution on commercial-off-the-shelf (COTS) SOTA DNN accelerators via a structured three-step methodology: (1) enabling NPU execution, (2) optimizing performance, and (3) trading accuracy for efficiency gains. Step 1 employs GraphSplit for workload distribution and StaGr for static aggregation, while GrAd and NodePad handle dynamic graphs. Step 2 boosts performance using EffOp for control-heavy tasks and GraSp for sparsity exploitation. Graph Convolution optimizations PreG, SymG, and CacheG reduce redundancy and memory transfers. Step 3 balances quality versus efficiency, where QuantGr applies INT8 quantization, and GrAx1, GrAx2, and GrAx3 accelerate attention, broadcast-add, and SAGE-max aggregation. On Intel Core Ultra AI PCs, GraNNite achieves 2.6X to 7.6X speedups over default NPU mappings and up to 8.6X energy gains over CPUs and GPUs, delivering 10.8X and 6.7X higher performance than CPUs and GPUs, respectively, across GNN models.

GraNNite: Enabling High-Performance Execution of Graph Neural Networks on Resource-Constrained Neural Processing Units

TL;DR

GraNNite tackles the challenge of running graph neural networks on resource-constrained NPUs at the edge by presenting a hardware-aware three-step framework. It partitions workloads (GraphSplit), optimizes compute and memory (EffOp, GraSp, PreG, SymG, CacheG), and introduces accuracy–energy tradeoffs (QuantGr, GrAx1–GrAx3) to achieve real-time edge inference. Core contributions include a suite of techniques for enabling GNNs on NPUs and substantial empirical gains on Intel Core Ultra AI PCs, with speedups up to 7.6× and energy improvements up to 8.6× over CPU/GPU baselines. The work demonstrates that no hardware changes are needed, and the approach generalizes to other models and accelerators, enhancing the practicality of edge GNN deployment for tasks like RAG, event-driven vision, and streaming graph analytics.

Abstract

Graph Neural Networks (GNNs) are vital for learning from graph-structured data, enabling applications in network analysis, recommendation systems, and speech analytics. Deploying them on edge devices like client PCs and laptops enhances real-time processing, privacy, and cloud independence. GNNs aid Retrieval-Augmented Generation (RAG) for Large Language Models (LLMs) and enable event-based vision tasks. However, irregular memory access, sparsity, and dynamic structures cause high latency and energy overhead on resource-constrained devices. While modern edge processors integrate CPUs, GPUs, and NPUs, NPUs designed for data-parallel tasks struggle with irregular GNN computations. We introduce GraNNite, the first hardware-aware framework optimizing GNN execution on commercial-off-the-shelf (COTS) SOTA DNN accelerators via a structured three-step methodology: (1) enabling NPU execution, (2) optimizing performance, and (3) trading accuracy for efficiency gains. Step 1 employs GraphSplit for workload distribution and StaGr for static aggregation, while GrAd and NodePad handle dynamic graphs. Step 2 boosts performance using EffOp for control-heavy tasks and GraSp for sparsity exploitation. Graph Convolution optimizations PreG, SymG, and CacheG reduce redundancy and memory transfers. Step 3 balances quality versus efficiency, where QuantGr applies INT8 quantization, and GrAx1, GrAx2, and GrAx3 accelerate attention, broadcast-add, and SAGE-max aggregation. On Intel Core Ultra AI PCs, GraNNite achieves 2.6X to 7.6X speedups over default NPU mappings and up to 8.6X energy gains over CPUs and GPUs, delivering 10.8X and 6.7X higher performance than CPUs and GPUs, respectively, across GNN models.

Paper Structure

This paper contains 10 sections, 23 figures.

Figures (23)

  • Figure 1: Applications of GNNs on Client PCs: showcasing GNN-driven tasks like recommendations and event-driven vision, mapped onto Intel® Core™ Ultra processors for faster response and lower power.
  • Figure 2: Three fundamental GNNs: GCN, GAT, and GraphSAGE, emphasizing their unique approaches—convolutional aggregation, attention-based weighting, and neighbor sampling for scalability.
  • Figure 3: Execution flow of a GCN: graph preprocessing followed by iterative aggregation and combination phases for GNN computation raha_book_chapter.
  • Figure 4: Execution Latency Breakdown of GraphConv and GraphAttn Layers (1433 input features and 64 output features) on Intel® Core™ Ultra Series 2 NPU across graph preprocessing (DPU/DSP) and GNN computation (DPU/DSP) for a graph with 1354 nodes and 5429 edges.
  • Figure 5: Execution latency breakdown of GNN computation of a single GraphConv and GraphAttn layer (1433 input features and 64 output features) on Intel® Core™ Ultra Series 2 NPU across operations openvino_ops for a graph with 1354 nodes and 5429 edges.
  • ...and 18 more figures