Table of Contents
Fetching ...

Temporal Information Reconstruction and Non-Aligned Residual in Spiking Neural Networks for Speech Classification

Qi Zhang, Huamin Wang, Hangchi Shen, Shukai Duan, Shiping Wen, Tingwen Huang

TL;DR

The paper tackles the limitation of single-temporal-resolution processing in spiking neural networks for speech classification by introducing Temporal Reconstruction (TR), which reconstructs the temporal dimension to capture information at multiple time scales, and Non-Aligned Residual (NAR), which enables residual connections between sequences of different lengths. Integrated into a SNN-Delays baseline, TR and NAR enable multi-time-scale representations and flexible residual learning, yielding state-of-the-art results on spike-based SSC (81.02% test accuracy) and SHD (96.04% test accuracy), while also improving throughput and energy efficiency on non-spike data. These contributions advance energy-efficient, multi-scale temporal processing in neuromorphic speech systems, with practical implications for robust real-time audio classification and neuromorphic hardware applications.

Abstract

Recently, it can be noticed that most models based on spiking neural networks (SNNs) only use a same level temporal resolution to deal with speech classification problems, which makes these models cannot learn the information of input data at different temporal scales. Additionally, owing to the different time lengths of the data before and after the sub-modules of many models, the effective residual connections cannot be applied to optimize the training processes of these models.To solve these problems, on the one hand, we reconstruct the temporal dimension of the audio spectrum to propose a novel method named as Temporal Reconstruction (TR) by referring the hierarchical processing process of the human brain for understanding speech. Then, the reconstructed SNN model with TR can learn the information of input data at different temporal scales and model more comprehensive semantic information from audio data because it enables the networks to learn the information of input data at different temporal resolutions. On the other hand, we propose the Non-Aligned Residual (NAR) method by analyzing the audio data, which allows the residual connection can be used in two audio data with different time lengths. We have conducted plentiful experiments on the Spiking Speech Commands (SSC), the Spiking Heidelberg Digits (SHD), and the Google Speech Commands v0.02 (GSC) datasets. According to the experiment results, we have achieved the state-of-the-art (SOTA) result 81.02\% on SSC for the test classification accuracy of all SNN models, and we have obtained the SOTA result 96.04\% on SHD for the classification accuracy of all models.

Temporal Information Reconstruction and Non-Aligned Residual in Spiking Neural Networks for Speech Classification

TL;DR

The paper tackles the limitation of single-temporal-resolution processing in spiking neural networks for speech classification by introducing Temporal Reconstruction (TR), which reconstructs the temporal dimension to capture information at multiple time scales, and Non-Aligned Residual (NAR), which enables residual connections between sequences of different lengths. Integrated into a SNN-Delays baseline, TR and NAR enable multi-time-scale representations and flexible residual learning, yielding state-of-the-art results on spike-based SSC (81.02% test accuracy) and SHD (96.04% test accuracy), while also improving throughput and energy efficiency on non-spike data. These contributions advance energy-efficient, multi-scale temporal processing in neuromorphic speech systems, with practical implications for robust real-time audio classification and neuromorphic hardware applications.

Abstract

Recently, it can be noticed that most models based on spiking neural networks (SNNs) only use a same level temporal resolution to deal with speech classification problems, which makes these models cannot learn the information of input data at different temporal scales. Additionally, owing to the different time lengths of the data before and after the sub-modules of many models, the effective residual connections cannot be applied to optimize the training processes of these models.To solve these problems, on the one hand, we reconstruct the temporal dimension of the audio spectrum to propose a novel method named as Temporal Reconstruction (TR) by referring the hierarchical processing process of the human brain for understanding speech. Then, the reconstructed SNN model with TR can learn the information of input data at different temporal scales and model more comprehensive semantic information from audio data because it enables the networks to learn the information of input data at different temporal resolutions. On the other hand, we propose the Non-Aligned Residual (NAR) method by analyzing the audio data, which allows the residual connection can be used in two audio data with different time lengths. We have conducted plentiful experiments on the Spiking Speech Commands (SSC), the Spiking Heidelberg Digits (SHD), and the Google Speech Commands v0.02 (GSC) datasets. According to the experiment results, we have achieved the state-of-the-art (SOTA) result 81.02\% on SSC for the test classification accuracy of all SNN models, and we have obtained the SOTA result 96.04\% on SHD for the classification accuracy of all models.
Paper Structure (17 sections, 7 equations, 5 figures, 6 tables)

This paper contains 17 sections, 7 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: LIF neurons
  • Figure 2: The structure of the SNN-Delays model that incorporates Temporal Reconstruction (TR) and Non-Aligned Residual (NAR) methods. TR allows the model to learn information carried by multiple timescales in the data, and NAR enables residual connections to be applied to the data of varying time lengths.
  • Figure 3: TR-o (TR with overlap). The blue blocks represent no spikes, the green blocks represent standard spikes, and the pink blocks represent strong spikes. (a) A schematic diagram when the group-within time length can be evenly divided by the total time length with group-within time length is 2 and stride is 1. (b) A schematic diagram when the group-within time length cannot be evenly divided by the total time length with group-within time length is 3 and stride is 2.
  • Figure 4: TR-no (TR without overlap). The meanings represented by the blocks of different colors are the same as those in Figure \ref{['fig-tro']}. (a) A schematic diagram when the group-within time length can be evenly divided by the total time length. (b) A schematic diagram when the group-within time length cannot be evenly divided by the total time length.
  • Figure 5: NAR. This method can allow residual connections to be applied to the data of varying time lengths.