Table of Contents
Fetching ...

Micro-power spoken keyword spotting on Xylo Audio 2

Hannah Bos, Dylan R. Muir

TL;DR

The implementation of a spoken audio keyword-spotting benchmark"Aloha" on the Xylo Audio 2 Neuromorphic processor device, and the results show that Neuromorphic designs are well-suited for real-time near- and in-sensor processing on edge devices.

Abstract

For many years, designs for "Neuromorphic" or brain-like processors have been motivated by achieving extreme energy efficiency, compared with von-Neumann and tensor processor devices. As part of their design language, Neuromorphic processors take advantage of weight, parameter, state and activity sparsity. In the extreme case, neural networks based on these principles mimic the sparse activity oof biological nervous systems, in ``Spiking Neural Networks'' (SNNs). Few benchmarks are available for Neuromorphic processors, that have been implemented for a range of Neuromorphic and non-Neuromorphic platforms, which can therefore demonstrate the energy benefits of Neuromorphic processor designs. Here we describes the implementation of a spoken audio keyword-spotting (KWS) benchmark "Aloha" on the Xylo Audio 2 (SYNS61210) Neuromorphic processor device. We obtained high deployed quantized task accuracy, (95%), exceeding the benchmark task accuracy. We measured real continuous power of the deployed application on Xylo. We obtained best-in-class dynamic inference power ($291μ$W) and best-in-class inference efficiency ($6.6μ$J / Inf). Xylo sets a new minimum power for the Aloha KWS benchmark, and highlights the extreme energy efficiency achievable with Neuromorphic processor designs. Our results show that Neuromorphic designs are well-suited for real-time near- and in-sensor processing on edge devices.

Micro-power spoken keyword spotting on Xylo Audio 2

TL;DR

The implementation of a spoken audio keyword-spotting benchmark"Aloha" on the Xylo Audio 2 Neuromorphic processor device, and the results show that Neuromorphic designs are well-suited for real-time near- and in-sensor processing on edge devices.

Abstract

For many years, designs for "Neuromorphic" or brain-like processors have been motivated by achieving extreme energy efficiency, compared with von-Neumann and tensor processor devices. As part of their design language, Neuromorphic processors take advantage of weight, parameter, state and activity sparsity. In the extreme case, neural networks based on these principles mimic the sparse activity oof biological nervous systems, in ``Spiking Neural Networks'' (SNNs). Few benchmarks are available for Neuromorphic processors, that have been implemented for a range of Neuromorphic and non-Neuromorphic platforms, which can therefore demonstrate the energy benefits of Neuromorphic processor designs. Here we describes the implementation of a spoken audio keyword-spotting (KWS) benchmark "Aloha" on the Xylo Audio 2 (SYNS61210) Neuromorphic processor device. We obtained high deployed quantized task accuracy, (95%), exceeding the benchmark task accuracy. We measured real continuous power of the deployed application on Xylo. We obtained best-in-class dynamic inference power (W) and best-in-class inference efficiency (J / Inf). Xylo sets a new minimum power for the Aloha KWS benchmark, and highlights the extreme energy efficiency achievable with Neuromorphic processor designs. Our results show that Neuromorphic designs are well-suited for real-time near- and in-sensor processing on edge devices.
Paper Structure (8 sections, 1 equation, 6 figures, 2 tables)

This paper contains 8 sections, 1 equation, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Distribution of sample durations in the Aloha dataset. In this work we pad and clip samples to a uniform 3s duration (dashed line). This retains the majority of data in both train and test datasets.
  • Figure 2: Audio preprocessing approach.a The stages of audio preprocessing in Xylo Audio 2. Single-channel audio arrives at a microphone (b). This passes through a band-pass Butterworth filterbank, and is split into $N=16$ frequency bands (c). Filter output is rectified (d) before passing through a bank of LIF neurons that smooth and quantize the signals in each band. The result is a set of sparse event channels (e), where the firing intensity in each channel is proportional to the instantaneous energy in each frequency band.
  • Figure 3: The SynNet architecture used in this benchmark. Event-encoded audio is provided as input, as described in Figure \ref{['fig:audio_preprocessing']}. The network consists of a single feed-forward chain of fully-connected layers, using the LIF neuron model. Several time constants are distributed over each layer, with shorter time constants in early layers and longer time constants in later layers (see text for details). A single readout LIF neuron is used in each network.
  • Figure 4: ROC curves for the trained models in Table \ref{['tab:model_performance']}.a True Postive Rate vs False Positve Rate curves. b Accuracy for the several models while varying the threshold of the readout neuron.
  • Figure 5: The Xylo™ Audio 2 hardware development kit (HDK). The HDK is a USB bus-power board requiring a PC-host for power and interfacing. The HDK interfaces with the open-source Rockpool toolchain for deployment and testing. An analog microphone and a analog jack are provided for direct analog single-channel differential input. Encoded audio data can alternatively be streamed from the host PC. Inference is performed on the Xylo device (red outline).
  • ...and 1 more figures