Table of Contents
Fetching ...

Neurobench: DCASE 2020 Acoustic Scene Classification benchmark on XyloAudio 2

Weijie Ke, Mina Khoei, Dylan Muir

TL;DR

The benchmark dataset; the audio preprocessing approach; and the network architecture and training approach are described; the performance of the trained model is presented, and the results of power and latency measurements performed on the XyloAudio 2 development kit are presented.

Abstract

XyloAudio is a line of ultra-low-power audio inference chips, designed for in- and near-microphone analysis of audio in real-time energy-constrained scenarios. Xylo is designed around a highly efficient integer-logic processor which simulates parameter- and activity-sparse spiking neural networks (SNNs) using a leaky integrate-and-fire (LIF) neuron model. Neurons on Xylo are quantised integer devices operating in synchronous digital CMOS, with neuron and synapse state quantised to 16 bit, and weight parameters quantised to 8 bit. Xylo is tailored for real-time streaming operation, as opposed to accelerated-time operation in the case of an inference accelerator. XyloAudio includes a low-power audio encoding interface for direct connection to a microphone, designed for sparse encoding of incident audio for further processing by the inference core. In this report we present the results of DCASE 2020 acoustic scene classification audio benchmark dataset deployed to XyloAudio 2. We describe the benchmark dataset; the audio preprocessing approach; and the network architecture and training approach. We present the performance of the trained model, and the results of power and latency measurements performed on the XyloAudio 2 development kit. This benchmark is conducted as part of the Neurobench project.

Neurobench: DCASE 2020 Acoustic Scene Classification benchmark on XyloAudio 2

TL;DR

The benchmark dataset; the audio preprocessing approach; and the network architecture and training approach are described; the performance of the trained model is presented, and the results of power and latency measurements performed on the XyloAudio 2 development kit are presented.

Abstract

XyloAudio is a line of ultra-low-power audio inference chips, designed for in- and near-microphone analysis of audio in real-time energy-constrained scenarios. Xylo is designed around a highly efficient integer-logic processor which simulates parameter- and activity-sparse spiking neural networks (SNNs) using a leaky integrate-and-fire (LIF) neuron model. Neurons on Xylo are quantised integer devices operating in synchronous digital CMOS, with neuron and synapse state quantised to 16 bit, and weight parameters quantised to 8 bit. Xylo is tailored for real-time streaming operation, as opposed to accelerated-time operation in the case of an inference accelerator. XyloAudio includes a low-power audio encoding interface for direct connection to a microphone, designed for sparse encoding of incident audio for further processing by the inference core. In this report we present the results of DCASE 2020 acoustic scene classification audio benchmark dataset deployed to XyloAudio 2. We describe the benchmark dataset; the audio preprocessing approach; and the network architecture and training approach. We present the performance of the trained model, and the results of power and latency measurements performed on the XyloAudio 2 development kit. This benchmark is conducted as part of the Neurobench project.

Paper Structure

This paper contains 2 sections, 2 equations, 4 figures, 1 table.

Figures (4)

  • Figure 1: Audio preprocessing approachbos2024micropower.a The stages of audio preprocessing in Xylo™Audio 2. Single-channel audio arrives at a microphone (b). This passes through a band-pass Butterworth filterbank, and is split into $N=16$ frequency bands (c). Filter output is rectified (d) before passing through a bank of LIF neurons that smooth and quantize the signals in each band. The result is a set of sparse event channels (e), where the firing intensity in each channel is proportional to the instantaneous energy in each frequency band.
  • Figure 2: The SynNet architecture used in this benchmarkbos_sub-mw_2022bos2024micropower. Event-encoded audio is provided as input, as described in Figure \ref{['fig:audio_preprocessing']}. The network consists of a single feed-forward chain of fully-connected layers, using the LIF neuron model. Several time constants are distributed over each layer, with shorter time constants in early layers and longer time constants in later layers (see text for details). Four readout readout LIF neuron is used in each network.
  • Figure 3: The Xylo™Audio 2 hardware development kit (HDK). The HDK is a USB bus-power board requiring a PC-host for power and interfacing. The HDK interfaces with the open-source Rockpool toolchain for deployment and testing. An analog microphone and a analog jack are provided for direct analog single-channel differential input. Encoded audio data can alternatively be streamed from the host PC. Inference is performed on the Xylo device (red outline).
  • Figure 4: Benchmarking system overview. The Xylo development kit ("Xylo HDK", green) is connected via a USB cable to a PC. In dataset-driven inference mode, a simulation of the audio encoding block is used to encode audio samples to events ("AFESim"). These encoded samples are streamed to the the SNN inference core in the Xylo™Audio 2 device (solid lines). Inference is performed entirely on the SNN inference core, and the readout events returned to the PC via usb (solid lines). The PC performs an $\textrm{argmax}$ to obtain the inferred class. An FPGA on the development kit manages configuration and power measurement. In live streaming mode, audio is recorded by a microphone on the development kit, sent to the Audio Front-End (AFE) core in the Xylo™Audio 2 device.