Table of Contents
Fetching ...

Towards efficient keyword spotting using spike-based time difference encoders

Alejandro Pequeño-Zurro, Lyes Khacef, Stefano Panzeri, Elisabetta Chicca

TL;DR

The paper investigates the Temporal Difference Encoder (TDE) as an efficient neuron model for keyword spotting on neuromorphic hardware. By comparing TDE-based networks to CuBa-LIF variants in a three-layer SNN, the study demonstrates that temporal spike timing carries substantial discriminative information for formant-based speech and can achieve near-parallel accuracy with far fewer synaptic operations. Data-driven pruning further reduces network size with minimal accuracy loss, and TDE networks show favorable training efficiency and interpretable feature footprints mapped to frequency pairs and timescales. These findings suggest that TDE offers a scalable, energy-efficient approach for event-driven spatio-temporal pattern processing in edge-friendly speech recognition tasks.

Abstract

Keyword spotting in edge devices is becoming increasingly important as voice-activated assistants are widely used. However, its deployment is often limited by the extreme low-power constraints of the target embedded systems. Here, we explore the Temporal Difference Encoder (TDE) performance in keyword spotting. This recent neuron model encodes the time difference in instantaneous frequency and spike count to perform efficient keyword spotting with neuromorphic processors. We use the TIdigits dataset of spoken digits with a formant decomposition and rate-based encoding into spikes. We compare three Spiking Neural Networks (SNNs) architectures to learn and classify spatio-temporal signals. The proposed SNN architectures are made of three layers with variation in its hidden layer composed of either (1) feedforward TDE, (2) feedforward Current-Based Leaky Integrate-and-Fire (CuBa-LIF), or (3) recurrent CuBa-LIF neurons. We first show that the spike trains of the frequency-converted spoken digits have a large amount of information in the temporal domain, reinforcing the importance of better exploiting temporal encoding for such a task. We then train the three SNNs with the same number of synaptic weights to quantify and compare their performance based on the accuracy and synaptic operations. The resulting accuracy of the feedforward TDE network (89%) is higher than the feedforward CuBa-LIF network (71%) and close to the recurrent CuBa-LIF network (91%). However, the feedforward TDE-based network performs 92% fewer synaptic operations than the recurrent CuBa-LIF network with the same amount of synapses. In addition, the results of the TDE network are highly interpretable and correlated with the frequency and timescale features of the spoken keywords in the dataset. Our findings suggest that the TDE is a promising neuron model for scalable event-driven processing of spatio-temporal patterns.

Towards efficient keyword spotting using spike-based time difference encoders

TL;DR

The paper investigates the Temporal Difference Encoder (TDE) as an efficient neuron model for keyword spotting on neuromorphic hardware. By comparing TDE-based networks to CuBa-LIF variants in a three-layer SNN, the study demonstrates that temporal spike timing carries substantial discriminative information for formant-based speech and can achieve near-parallel accuracy with far fewer synaptic operations. Data-driven pruning further reduces network size with minimal accuracy loss, and TDE networks show favorable training efficiency and interpretable feature footprints mapped to frequency pairs and timescales. These findings suggest that TDE offers a scalable, energy-efficient approach for event-driven spatio-temporal pattern processing in edge-friendly speech recognition tasks.

Abstract

Keyword spotting in edge devices is becoming increasingly important as voice-activated assistants are widely used. However, its deployment is often limited by the extreme low-power constraints of the target embedded systems. Here, we explore the Temporal Difference Encoder (TDE) performance in keyword spotting. This recent neuron model encodes the time difference in instantaneous frequency and spike count to perform efficient keyword spotting with neuromorphic processors. We use the TIdigits dataset of spoken digits with a formant decomposition and rate-based encoding into spikes. We compare three Spiking Neural Networks (SNNs) architectures to learn and classify spatio-temporal signals. The proposed SNN architectures are made of three layers with variation in its hidden layer composed of either (1) feedforward TDE, (2) feedforward Current-Based Leaky Integrate-and-Fire (CuBa-LIF), or (3) recurrent CuBa-LIF neurons. We first show that the spike trains of the frequency-converted spoken digits have a large amount of information in the temporal domain, reinforcing the importance of better exploiting temporal encoding for such a task. We then train the three SNNs with the same number of synaptic weights to quantify and compare their performance based on the accuracy and synaptic operations. The resulting accuracy of the feedforward TDE network (89%) is higher than the feedforward CuBa-LIF network (71%) and close to the recurrent CuBa-LIF network (91%). However, the feedforward TDE-based network performs 92% fewer synaptic operations than the recurrent CuBa-LIF network with the same amount of synapses. In addition, the results of the TDE network are highly interpretable and correlated with the frequency and timescale features of the spoken keywords in the dataset. Our findings suggest that the TDE is a promising neuron model for scalable event-driven processing of spatio-temporal patterns.

Paper Structure

This paper contains 16 sections, 12 equations, 11 figures, 6 tables.

Figures (11)

  • Figure 1: System pipeline. From speech generation, we extract offline formant decomposition in frequency and amplitude of the three formants with the biggest energy. Frequency is then represented as spatial information in the number of input neurons and amplitude of the formants is encoded in L0 into spikes. Optimizing the network's parameter for performance in the classification task.
  • Figure 2: Keyword categories decomposed in formant frequency bands are transformed into spikes in the encoding layer of the networks.
  • Figure 3: Network architectures compared along the manuscript.
  • Figure 4: Temporal Difference Encoder in presence of input spikes.
  • Figure 5: Information about spike pattern and spike rate of the encoded spikes about the dataset. Left figure represents the methodology for the calculation of I_patten versus I_rate from the spike trains. $\Delta t$ represents the time bin which defines the precision of the time pattern code. Increase of the time bin reduces the length of the pattern for a fixed time of 400ms from the first spike (only applies to this metric). Right figure shows the amount of information about the rate against the pattern of spikes in the encoded dataset. Precision of the pattern of information maximizes spike pattern around 60ms. Mean and standard deviation for each of the invidual frequency channels that serves as input in the networks.
  • ...and 6 more figures