Table of Contents
Fetching ...

Efficient Gesture Recognition on Spiking Convolutional Networks Through Sensor Fusion of Event-Based and Depth Data

Lea Steffen, Thomas Trapp, Arne Roennau, Rüdiger Dillmann

TL;DR

Gesture recognition for accessible human–robot interaction is addressed by a spiking convolutional neural network that fuses event-based data with temporally encoded depth. The approach uses surrogate-gradient training in the Lava framework and evaluates on a newly recorded synchronized bimodal dataset, with deployment on an embedded platform to assess practicality. A key contribution is the demonstration of temporal depth encoding and multimodal fusion in SNNs, along with a bimodal dataset that enables training across modalities. While event data provide strong offline performance, depth and fusion show more modest gains, and real-time deployment remains a target for future optimization, highlighting the potential and current limits of neuromorphic gesture recognition on resource-constrained devices.

Abstract

As intelligent systems become increasingly important in our daily lives, new ways of interaction are needed. Classical user interfaces pose issues for the physically impaired and are partially not practical or convenient. Gesture recognition is an alternative, but often not reactive enough when conventional cameras are used. This work proposes a Spiking Convolutional Neural Network, processing event- and depth data for gesture recognition. The network is simulated using the open-source neuromorphic computing framework LAVA for offline training and evaluation on an embedded system. For the evaluation three open source data sets are used. Since these do not represent the applied bi-modality, a new data set with synchronized event- and depth data was recorded. The results show the viability of temporal encoding on depth information and modality fusion, even on differently encoded data, to be beneficial to network performance and generalization capabilities.

Efficient Gesture Recognition on Spiking Convolutional Networks Through Sensor Fusion of Event-Based and Depth Data

TL;DR

Gesture recognition for accessible human–robot interaction is addressed by a spiking convolutional neural network that fuses event-based data with temporally encoded depth. The approach uses surrogate-gradient training in the Lava framework and evaluates on a newly recorded synchronized bimodal dataset, with deployment on an embedded platform to assess practicality. A key contribution is the demonstration of temporal depth encoding and multimodal fusion in SNNs, along with a bimodal dataset that enables training across modalities. While event data provide strong offline performance, depth and fusion show more modest gains, and real-time deployment remains a target for future optimization, highlighting the potential and current limits of neuromorphic gesture recognition on resource-constrained devices.

Abstract

As intelligent systems become increasingly important in our daily lives, new ways of interaction are needed. Classical user interfaces pose issues for the physically impaired and are partially not practical or convenient. Gesture recognition is an alternative, but often not reactive enough when conventional cameras are used. This work proposes a Spiking Convolutional Neural Network, processing event- and depth data for gesture recognition. The network is simulated using the open-source neuromorphic computing framework LAVA for offline training and evaluation on an embedded system. For the evaluation three open source data sets are used. Since these do not represent the applied bi-modality, a new data set with synchronized event- and depth data was recorded. The results show the viability of temporal encoding on depth information and modality fusion, even on differently encoded data, to be beneficial to network performance and generalization capabilities.
Paper Structure (13 sections, 4 equations, 6 figures, 2 tables)

This paper contains 13 sections, 4 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: The network architecture includes feature extraction and sensor fusion. Event streams and temporally encoded depth data are used as input for the feature extraction. The high-level features are then used together for classification.
  • Figure 2: Preprocessing is only required for the \ref{['fig:encoding_depth']} depth data, as the \ref{['fig:encoding_event']} event stream of the ATIS can be used directly as input for SNN. The depth data are encoded with TTFS to be available in spike trains as well.
  • Figure 3: Examples of gestures from the newly recorded dataset, featuring the encoded depth data from the RealSense in \ref{['fig:throw_depth']} and \ref{['fig:arms_depth']}. Respectively, events from the ATIS are visualized in \ref{['fig:throw_event']} and \ref{['fig:arms_event']}. Featured gestures are throwing an object (\ref{['fig:throw_depth']}, \ref{['fig:throw_event']}) and crossing the arms in front of the chest (\ref{['fig:arms_depth']}, \ref{['fig:arms_event']}).
  • Figure 4: Sensor setup of the ATIS and Intel Realsense, enabling the development of a synchronized bimodal dataset for gesture.
  • Figure 5: Mean accuracies during training between different runs. \ref{['fig:core_accuracy']} shows the accuracy of networks using the different modalities on a core set of gestures. In \ref{['fig:extended_accuracy']} the same networks are trained on larger data.
  • ...and 1 more figures