Hardware-accelerated graph neural networks: an alternative approach for neuromorphic event-based audio classification and keyword spotting on SoC FPGA
Kamil Jeziorek, Piotr Wzorek, Krzysztof Blachut, Hiroshi Nakano, Manon Dampfhoffer, Thomas Mesquida, Hiroaki Nishi, Thomas Dalgaty, Tomasz Kryjak
TL;DR
This work presents the first hardware-accelerated event-graph neural network for time-series audio on a SoC FPGA, targeting low-latency, energy-efficient edge processing. It introduces a hardware-aware graph generator and an enhanced graph convolution with positional normalisation, enabling fully asynchronous event-by-event processing. The authors demonstrate end-to-end FPGA pipelines for both classification and keyword spotting, achieving state-of-the-art or competitive accuracy on SHD and SSC while using far fewer parameters and lower latency than prior FPGA SNNs, and they report the first hardware-accelerated KWS results on SSC. Compared with embedded GPUs, the FPGA solution delivers orders-of-magnitude speedups and favorable power characteristics, establishing a practical benchmark for near-sensor event-driven audio processing.
Abstract
As the volume of data recorded by embedded edge sensors increases, particularly from neuromorphic devices producing discrete event streams, there is a growing need for hardware-aware neural architectures that enable efficient, low-latency, and energy-conscious local processing. We present an FPGA implementation of event-graph neural networks for audio processing. We utilise an artificial cochlea that converts time-series signals into sparse event data, reducing memory and computation costs. Our architecture was implemented on a SoC FPGA and evaluated on two open-source datasets. For classification task, our baseline floating-point model achieves 92.7% accuracy on SHD dataset - only 2.4% below the state of the art - while requiring over 10x and 67x fewer parameters. On SSC, our models achieve 66.9-71.0% accuracy. Compared to FPGA-based spiking neural networks, our quantised model reaches 92.3% accuracy, outperforming them by up to 19.3% while reducing resource usage and latency. For SSC, we report the first hardware-accelerated evaluation. We further demonstrate the first end-to-end FPGA implementation of event-audio keyword spotting, combining graph convolutional layers with recurrent sequence modelling. The system achieves up to 95% word-end detection accuracy, with only 10.53 microsecond latency and 1.18 W power consumption, establishing a strong benchmark for energy-efficient event-driven KWS.
