Table of Contents
Fetching ...

A Comparison of Temporal Encoders for Neuromorphic Keyword Spotting with Few Neurons

Mattias Nilsson, Ton Juny Pina, Lyes Khacef, Foteini Liwicki, Elisabetta Chicca, Fredrik Sandin

TL;DR

Two candidate neurocomputational elements for temporal encoding and feature extraction in SNNs described in recent literature are comparatively investigated in a keyword-spotting task on formants computed from spoken digits in the TIDIGITS dataset.

Abstract

With the expansion of AI-powered virtual assistants, there is a need for low-power keyword spotting systems providing a "wake-up" mechanism for subsequent computationally expensive speech recognition. One promising approach is the use of neuromorphic sensors and spiking neural networks (SNNs) implemented in neuromorphic processors for sparse event-driven sensing. However, this requires resource-efficient SNN mechanisms for temporal encoding, which need to consider that these systems process information in a streaming manner, with physical time being an intrinsic property of their operation. In this work, two candidate neurocomputational elements for temporal encoding and feature extraction in SNNs described in recent literature - the spiking time-difference encoder (TDE) and disynaptic excitatory-inhibitory (E-I) elements - are comparatively investigated in a keyword-spotting task on formants computed from spoken digits in the TIDIGITS dataset. While both encoders improve performance over direct classification of the formant features in the training data, enabling a complete binary classification with a logistic regression model, they show no clear improvements on the test set. Resource-efficient keyword spotting applications may benefit from the use of these encoders, but further work on methods for learning the time constants and weights is required to investigate their full potential.

A Comparison of Temporal Encoders for Neuromorphic Keyword Spotting with Few Neurons

TL;DR

Two candidate neurocomputational elements for temporal encoding and feature extraction in SNNs described in recent literature are comparatively investigated in a keyword-spotting task on formants computed from spoken digits in the TIDIGITS dataset.

Abstract

With the expansion of AI-powered virtual assistants, there is a need for low-power keyword spotting systems providing a "wake-up" mechanism for subsequent computationally expensive speech recognition. One promising approach is the use of neuromorphic sensors and spiking neural networks (SNNs) implemented in neuromorphic processors for sparse event-driven sensing. However, this requires resource-efficient SNN mechanisms for temporal encoding, which need to consider that these systems process information in a streaming manner, with physical time being an intrinsic property of their operation. In this work, two candidate neurocomputational elements for temporal encoding and feature extraction in SNNs described in recent literature - the spiking time-difference encoder (TDE) and disynaptic excitatory-inhibitory (E-I) elements - are comparatively investigated in a keyword-spotting task on formants computed from spoken digits in the TIDIGITS dataset. While both encoders improve performance over direct classification of the formant features in the training data, enabling a complete binary classification with a logistic regression model, they show no clear improvements on the test set. Resource-efficient keyword spotting applications may benefit from the use of these encoders, but further work on methods for learning the time constants and weights is required to investigate their full potential.
Paper Structure (14 sections, 8 figures, 1 table)

This paper contains 14 sections, 8 figures, 1 table.

Figures (8)

  • Figure 1: Spiking neural networks for keyword spotting with different temporal encoding layers. The linear classifiers (LCs) employ a logistic regression model, which is fitted using $\ell_2$ regularization.
  • Figure 2: Examples of formant spike-data. One spike is generated on each of the four channels that correspond to the most active auditory frequency bands in time bins of 1 ms. Ten subsequent samples are illustrated for each of the specified digits. Green color indicates samples for which true positive tests were generated by all single-neuron systems of Fig. \ref{['fig:single_neuron']}, while red indicates false negatives.
  • Figure 3: Basic principle of the time difference encoder (TDE). Adapted from angelo2020motion and gutierrez2022digital. (a) A TDE neurosynaptic unit. Input received by the trigger (trig) synapse is gated by the facilitatory (fac) synapse with a gain that depends on the temporal difference between the two inputs. (b)--(d) TDE responses for small, large, and negative time differences, respectively. (e) TDE time-difference spike-response curve.
  • Figure 4: Basic principle of disynaptic e-i elements.(a) One AdEx neuron receiving inputs through two different E--I elements, each consisting of one excitatory (Exc.) and one inhibitory (Inh.) dynamic synapse with different time constants. (b) Postsynaptic currents ($I_1$ and $I_2$) of the two E--I elements, which differ due to circuit inhomogeneity ("device mismatch"), resulting in different temporal delays and amplitudes of the postinhibitory excitations.
  • Figure 5: Permutation importance of neurons. The different line styles correspond to the keywords "one" (dashed), "two" (dotted), and "three" (dash-dotted), respectively, and the corresponding plots are offset with x = 10 for improving visibility. The permutation importance was evaluated on the test set for a logistic regression model fitted on the training data from all layers of the neural network. The neurons are sorted internally within each layer by magnitude of permutation importance.
  • ...and 3 more figures