Spiking Two-Stream Methods with Unsupervised STDP-based Learning for Action Recognition
Mireille El-Assal, Pierre Tirilly, Ioan Marius Bilasco
TL;DR
This work addresses the energy and data-label demands of traditional action-recognition models by transferring a CNN-inspired two-stream architecture to the spiking domain using Convolutional Spiking Neural Networks trained with unsupervised Spike Timing-Dependent Plasticity. The spatial stream captures appearance while multiple temporal streams model motion, and their features are fused for action classification. Across four datasets, the authors show that the spatial and temporal streams are complementary, with fusion improving performance in most cases; however, spatio-temporal (3D) streams can introduce feature redundancy and are sensitive to spatial noise. The findings highlight the practicality of STDP-based two-stream CSNNs for low-shot video understanding and point to future work on fully spiking pre-processing and classifiers to maximize neuromorphic efficiency.
Abstract
Video analysis is a computer vision task that is useful for many applications like surveillance, human-machine interaction, and autonomous vehicles. Deep Convolutional Neural Networks (CNNs) are currently the state-of-the-art methods for video analysis. However they have high computational costs, and need a large amount of labeled data for training. In this paper, we use Convolutional Spiking Neural Networks (CSNNs) trained with the unsupervised Spike Timing-Dependent Plasticity (STDP) learning rule for action classification. These networks represent the information using asynchronous low-energy spikes. This allows the network to be more energy efficient and neuromorphic hardware-friendly. However, the behaviour of CSNNs is not studied enough with spatio-temporal computer vision models. Therefore, we explore transposing two-stream neural networks into the spiking domain. Implementing this model with unsupervised STDP-based CSNNs allows us to further study the performance of these networks with video analysis. In this work, we show that two-stream CSNNs can successfully extract spatio-temporal information from videos despite using limited training data, and that the spiking spatial and temporal streams are complementary. We also show that using a spatio-temporal stream within a spiking STDP-based two-stream architecture leads to information redundancy and does not improve the performance.
