Table of Contents
Fetching ...

Digit Recognition using Multimodal Spiking Neural Networks

William Bjorndahl, Jack Easton, Austin Modoff, Eric C. Larson, Joseph Camp, Prasanna Rangarajan

TL;DR

This work investigates digit recognition with multimodal spiking neural networks by fusing event-based visual input from N-MNIST and auditory input from SHD. It implements unimodal and three multimodal SNN architectures using Leaky Integrate-and-Fire neurons with discrete-time processing, a cumulative-sum readout, and surrogate gradient learning, while evaluating fusion depth via early, middle, and late concatenation. Results show multimodal SNNs outperform unimodal baselines, achieving up to 98.43% accuracy on the combined dataset, and McNemar tests confirm significant gains in some configurations while indicating fusion depth is not critical to performance. The findings highlight robust neuromorphic multisensory integration and point to flexible fusion strategies for complex sensory processing tasks.

Abstract

Spiking neural networks (SNNs) are the third generation of neural networks that are biologically inspired to process data in a fashion that emulates the exchange of signals in the brain. Within the Computer Vision community SNNs have garnered significant attention due in large part to the availability of event-based sensors that produce a spatially resolved spike train in response to changes in scene radiance. SNNs are used to process event-based data due to their neuromorphic nature. The proposed work examines the neuromorphic advantage of fusing multiple sensory inputs in classification tasks. Specifically we study the performance of a SNN in digit classification by passing in a visual modality branch (Neuromorphic-MNIST [N-MNIST]) and an auditory modality branch (Spiking Heidelberg Digits [SHD]) from datasets that were created using event-based sensors to generate a series of time-dependent events. It is observed that multi-modal SNNs outperform unimodal visual and unimodal auditory SNNs. Furthermore, it is observed that the process of sensory fusion is insensitive to the depth at which the visual and auditory branches are combined. This work achieves a 98.43% accuracy on the combined N-MNIST and SHD dataset using a multimodal SNN that concatenates the visual and auditory branches at a late depth.

Digit Recognition using Multimodal Spiking Neural Networks

TL;DR

This work investigates digit recognition with multimodal spiking neural networks by fusing event-based visual input from N-MNIST and auditory input from SHD. It implements unimodal and three multimodal SNN architectures using Leaky Integrate-and-Fire neurons with discrete-time processing, a cumulative-sum readout, and surrogate gradient learning, while evaluating fusion depth via early, middle, and late concatenation. Results show multimodal SNNs outperform unimodal baselines, achieving up to 98.43% accuracy on the combined dataset, and McNemar tests confirm significant gains in some configurations while indicating fusion depth is not critical to performance. The findings highlight robust neuromorphic multisensory integration and point to flexible fusion strategies for complex sensory processing tasks.

Abstract

Spiking neural networks (SNNs) are the third generation of neural networks that are biologically inspired to process data in a fashion that emulates the exchange of signals in the brain. Within the Computer Vision community SNNs have garnered significant attention due in large part to the availability of event-based sensors that produce a spatially resolved spike train in response to changes in scene radiance. SNNs are used to process event-based data due to their neuromorphic nature. The proposed work examines the neuromorphic advantage of fusing multiple sensory inputs in classification tasks. Specifically we study the performance of a SNN in digit classification by passing in a visual modality branch (Neuromorphic-MNIST [N-MNIST]) and an auditory modality branch (Spiking Heidelberg Digits [SHD]) from datasets that were created using event-based sensors to generate a series of time-dependent events. It is observed that multi-modal SNNs outperform unimodal visual and unimodal auditory SNNs. Furthermore, it is observed that the process of sensory fusion is insensitive to the depth at which the visual and auditory branches are combined. This work achieves a 98.43% accuracy on the combined N-MNIST and SHD dataset using a multimodal SNN that concatenates the visual and auditory branches at a late depth.
Paper Structure (15 sections, 6 equations, 2 figures, 2 tables)

This paper contains 15 sections, 6 equations, 2 figures, 2 tables.

Figures (2)

  • Figure 1: Unimodal Network structures showing example data (digit two) as input for (a) visual and (b) auditory modalities.
  • Figure 2: Comparison of combining the visual and auditory branches at a (a) early, (b) middle, and (c) late depth in our multimodal SNN architecture.