Table of Contents
Fetching ...

Hardware-accelerated graph neural networks: an alternative approach for neuromorphic event-based audio classification and keyword spotting on SoC FPGA

Kamil Jeziorek, Piotr Wzorek, Krzysztof Blachut, Hiroshi Nakano, Manon Dampfhoffer, Thomas Mesquida, Hiroaki Nishi, Thomas Dalgaty, Tomasz Kryjak

TL;DR

This work presents the first hardware-accelerated event-graph neural network for time-series audio on a SoC FPGA, targeting low-latency, energy-efficient edge processing. It introduces a hardware-aware graph generator and an enhanced graph convolution with positional normalisation, enabling fully asynchronous event-by-event processing. The authors demonstrate end-to-end FPGA pipelines for both classification and keyword spotting, achieving state-of-the-art or competitive accuracy on SHD and SSC while using far fewer parameters and lower latency than prior FPGA SNNs, and they report the first hardware-accelerated KWS results on SSC. Compared with embedded GPUs, the FPGA solution delivers orders-of-magnitude speedups and favorable power characteristics, establishing a practical benchmark for near-sensor event-driven audio processing.

Abstract

As the volume of data recorded by embedded edge sensors increases, particularly from neuromorphic devices producing discrete event streams, there is a growing need for hardware-aware neural architectures that enable efficient, low-latency, and energy-conscious local processing. We present an FPGA implementation of event-graph neural networks for audio processing. We utilise an artificial cochlea that converts time-series signals into sparse event data, reducing memory and computation costs. Our architecture was implemented on a SoC FPGA and evaluated on two open-source datasets. For classification task, our baseline floating-point model achieves 92.7% accuracy on SHD dataset - only 2.4% below the state of the art - while requiring over 10x and 67x fewer parameters. On SSC, our models achieve 66.9-71.0% accuracy. Compared to FPGA-based spiking neural networks, our quantised model reaches 92.3% accuracy, outperforming them by up to 19.3% while reducing resource usage and latency. For SSC, we report the first hardware-accelerated evaluation. We further demonstrate the first end-to-end FPGA implementation of event-audio keyword spotting, combining graph convolutional layers with recurrent sequence modelling. The system achieves up to 95% word-end detection accuracy, with only 10.53 microsecond latency and 1.18 W power consumption, establishing a strong benchmark for energy-efficient event-driven KWS.

Hardware-accelerated graph neural networks: an alternative approach for neuromorphic event-based audio classification and keyword spotting on SoC FPGA

TL;DR

This work presents the first hardware-accelerated event-graph neural network for time-series audio on a SoC FPGA, targeting low-latency, energy-efficient edge processing. It introduces a hardware-aware graph generator and an enhanced graph convolution with positional normalisation, enabling fully asynchronous event-by-event processing. The authors demonstrate end-to-end FPGA pipelines for both classification and keyword spotting, achieving state-of-the-art or competitive accuracy on SHD and SSC while using far fewer parameters and lower latency than prior FPGA SNNs, and they report the first hardware-accelerated KWS results on SSC. Compared with embedded GPUs, the FPGA solution delivers orders-of-magnitude speedups and favorable power characteristics, establishing a practical benchmark for near-sensor event-driven audio processing.

Abstract

As the volume of data recorded by embedded edge sensors increases, particularly from neuromorphic devices producing discrete event streams, there is a growing need for hardware-aware neural architectures that enable efficient, low-latency, and energy-conscious local processing. We present an FPGA implementation of event-graph neural networks for audio processing. We utilise an artificial cochlea that converts time-series signals into sparse event data, reducing memory and computation costs. Our architecture was implemented on a SoC FPGA and evaluated on two open-source datasets. For classification task, our baseline floating-point model achieves 92.7% accuracy on SHD dataset - only 2.4% below the state of the art - while requiring over 10x and 67x fewer parameters. On SSC, our models achieve 66.9-71.0% accuracy. Compared to FPGA-based spiking neural networks, our quantised model reaches 92.3% accuracy, outperforming them by up to 19.3% while reducing resource usage and latency. For SSC, we report the first hardware-accelerated evaluation. We further demonstrate the first end-to-end FPGA implementation of event-audio keyword spotting, combining graph convolutional layers with recurrent sequence modelling. The system achieves up to 95% word-end detection accuracy, with only 10.53 microsecond latency and 1.18 W power consumption, establishing a strong benchmark for energy-efficient event-driven KWS.
Paper Structure (30 sections, 10 equations, 9 figures, 8 tables)

This paper contains 30 sections, 10 equations, 9 figures, 8 tables.

Figures (9)

  • Figure 1: In this work, we propose an event-based keyword spotting system in which speech signals are converted into asynchronous events by an artificial cochlea and represented as spectro-temporal event-graphs. These are processed by a GCN–RNN model deployed on a SoC FPGA, enabling low-power, low-latency, and efficient keyword spotting.
  • Figure 2: Overview of our hardware-accelerated event-graph neural network implementation. Events from the artificial cochlea are first read and preprocessed in the processing system (PS), and subsequently processed by an asynchronous graph neural network in the programmable logic (PL). Modules marked with dotted lines operate on an event-by-event basis. For the classification task, we employ a global average pooling layer, which updates an accumulator over the entire sample and returns the result to a fully connected classifier (MLP). For keyword spotting, the data is divided into smaller time windows ($\Delta$t), aggregated using a graph max pooling operation, and processed sequentially by two MLPs and a GRU, after which confidence and class scores are computed. In both tasks, we employ the same feature extractor, consisting of a graph generator followed by four graph convolution layers.
  • Figure 3: Diagram of the graph generation hardware module, which takes events from FIFO as input, and outputs the vertex features along with the list of its edges. The blue colour indicates the time (measured in clock cycles) required for the individual steps.
  • Figure 4: Diagram of the graph convolution hardware module. The arrows indicate the dataflow in the module, while the blue text indicates the time (measured in clock cycles) required for the individual steps.
  • Figure 5: Diagram of the head of the hardware KWS module. The key part of this module is vector multiplication shared between consecutive MLPs -- its utilisation is controlled with the state machine.
  • ...and 4 more figures