Spiking Neural Networks for event-based action recognition: A new task to understand their advantage
Alex Vicente-Sola, Davide L. Manna, Paul Kirkland, Gaetano Di Caterina, Trevor Bihl
TL;DR
This work investigates how Spiking Neural Networks (SNNs) can perform temporal feature extraction without recurrent connections by introducing DVS-Gesture-Chain (DVS-GC), a task that requires understanding the order of events in real neuromorphic data. By comparing feed-forward SNNs, recurrent SNNs, and various time-aware normalization and weighting schemes against non-spiking baselines, the study shows that SNNs can capture temporal dependencies beyond frame-accumulation, with leakage and reset mechanisms playing crucial roles. The DVS-Gesture results reveal that temporal processing can be achieved with SNNs and time-aware normalization, while DVS-GC demonstrates stronger temporal reasoning and transition detection, particularly when using leakage and appropriate reset strategies. The findings highlight the complementary roles of internal state, temporal attention, and normalization in temporal processing, offering practical guidance for designing energy-efficient, temporally capable neuromorphic systems and contributing to bridging neuroscience and machine learning perspectives.
Abstract
Spiking Neural Networks (SNN) are characterised by their unique temporal dynamics, but the properties and advantages of such computations are still not well understood. In order to provide answers, in this work we demonstrate how Spiking neurons can enable temporal feature extraction in feed-forward neural networks without the need for recurrent synapses, and how recurrent SNNs can achieve comparable results to LSTM with a smaller number of parameters. This shows how their bio-inspired computing principles can be successfully exploited beyond energy efficiency gains and evidences their differences with respect to conventional artificial neural networks. These results are obtained through a new task, DVS-Gesture-Chain (DVS-GC), which allows, for the first time, to evaluate the perception of temporal dependencies in a real event-based action recognition dataset. Our study proves how the widely used DVS Gesture benchmark can be solved by networks without temporal feature extraction when its events are accumulated in frames, unlike the new DVS-GC which demands an understanding of the order in which events happen. Furthermore, this setup allowed us to reveal the role of the leakage rate in spiking neurons for temporal processing tasks and demonstrated the benefits of "hard reset" mechanisms. Additionally, we also show how time-dependent weights and normalization can lead to understanding order by means of temporal attention.
