Table of Contents
Fetching ...

Spiking Neural Network Feature Discrimination Boosts Modality Fusion

Katerina Maria Oikonomou, Ioannis Kansizoglou, Antonios Gasteratos

TL;DR

This work addresses the challenge of learning discriminative representations in spiking neural networks (SNNs) for multi-modal audio-visual classification. It introduces a feature-discrimination framework that leverages L2 normalization to place embeddings on a hypersphere, a novel visual architecture (L2-ActSpikeNet) based on spiking ResNet18, a compact audio SNN, and a Spiking MLP (SMLP) fusion module to combine modalities. The approach achieves state-of-the-art-like performance on CIFAR10-AV and UrbanSound8K-AV, with fusion accuracies of 98.60% and 97.20% respectively, and demonstrates improved intra-class compactness and inter-class separability through feature normalization. The findings highlight the potential of combining angular discriminability with neuromorphic processing for energy-efficient, high-performance multi-modal learning, and open avenues for real-time neuromorphic hardware deployment.

Abstract

Feature discrimination is a crucial aspect of neural network design, as it directly impacts the network's ability to distinguish between classes and generalize across diverse datasets. The accomplishment of achieving high-quality feature representations ensures high intra-class separability and poses one of the most challenging research directions. While conventional deep neural networks (DNNs) rely on complex transformations and very deep networks to come up with meaningful feature representations, they usually require days of training and consume significant energy amounts. To this end, spiking neural networks (SNNs) offer a promising alternative. SNN's ability to capture temporal and spatial dependencies renders them particularly suitable for complex tasks, where multi-modal data are required. In this paper, we propose a feature discrimination approach for multi-modal learning with SNNs, focusing on audio-visual data. We employ deep spiking residual learning for visual modality processing and a simpler yet efficient spiking network for auditory modality processing. Lastly, we deploy a spiking multilayer perceptron for modality fusion. We present our findings and evaluate our approach against similar works in the field of classification challenges. To the best of our knowledge, this is the first work investigating feature discrimination in SNNs.

Spiking Neural Network Feature Discrimination Boosts Modality Fusion

TL;DR

This work addresses the challenge of learning discriminative representations in spiking neural networks (SNNs) for multi-modal audio-visual classification. It introduces a feature-discrimination framework that leverages L2 normalization to place embeddings on a hypersphere, a novel visual architecture (L2-ActSpikeNet) based on spiking ResNet18, a compact audio SNN, and a Spiking MLP (SMLP) fusion module to combine modalities. The approach achieves state-of-the-art-like performance on CIFAR10-AV and UrbanSound8K-AV, with fusion accuracies of 98.60% and 97.20% respectively, and demonstrates improved intra-class compactness and inter-class separability through feature normalization. The findings highlight the potential of combining angular discriminability with neuromorphic processing for energy-efficient, high-performance multi-modal learning, and open avenues for real-time neuromorphic hardware deployment.

Abstract

Feature discrimination is a crucial aspect of neural network design, as it directly impacts the network's ability to distinguish between classes and generalize across diverse datasets. The accomplishment of achieving high-quality feature representations ensures high intra-class separability and poses one of the most challenging research directions. While conventional deep neural networks (DNNs) rely on complex transformations and very deep networks to come up with meaningful feature representations, they usually require days of training and consume significant energy amounts. To this end, spiking neural networks (SNNs) offer a promising alternative. SNN's ability to capture temporal and spatial dependencies renders them particularly suitable for complex tasks, where multi-modal data are required. In this paper, we propose a feature discrimination approach for multi-modal learning with SNNs, focusing on audio-visual data. We employ deep spiking residual learning for visual modality processing and a simpler yet efficient spiking network for auditory modality processing. Lastly, we deploy a spiking multilayer perceptron for modality fusion. We present our findings and evaluate our approach against similar works in the field of classification challenges. To the best of our knowledge, this is the first work investigating feature discrimination in SNNs.

Paper Structure

This paper contains 22 sections, 25 equations, 9 figures, 4 tables.

Figures (9)

  • Figure 1: Overview of the proposed multi-modal spiking fusion architecture for audio-visual classification, integrating feature discrimination with spiking neural networks.
  • Figure 2: Architecture of the L2-ActSpikeNet visual feature extraction network, incorporating residual learning with spiking neurons and our proposed L2 normalization layer for enhanced feature discrimination.
  • Figure 3: Comparison of Fig. A the standard residual block in conventional ResNet and Fig. B the proposed ActAfterAddition block in our L2-ActSpikeNet approach.
  • Figure 4: The proposed spiking audio feature extraction network, with the L2 normalization layer for audio feature discrimination.
  • Figure 5: Architecture of the Spiking Multilayer Perceptron (SMLP) for multi-modal fusion, integrating normalized visual and audio feature embeddings for classification.
  • ...and 4 more figures