Enhancing Audio-Visual Spiking Neural Networks through Semantic-Alignment and Cross-Modal Residual Learning

Xiang He; Dongcheng Zhao; Yiting Dong; Guobin Shen; Xin Yang; Yi Zeng

Enhancing Audio-Visual Spiking Neural Networks through Semantic-Alignment and Cross-Modal Residual Learning

Xiang He, Dongcheng Zhao, Yiting Dong, Guobin Shen, Xin Yang, Yi Zeng

TL;DR

The paper addresses robust cross-modal fusion in spiking neural networks (SNNs) for audio-visual tasks. It introduces semantic-alignment cross-modal residual learning (S-CMRL), a Transformer-based multimodal SNN featuring a cross-modal complementary spatiotemporal spiking attention (CCSSA) and a semantic alignment optimization (SAO) module. CCSSA treats cross-modal information as residuals to preserve unimodal semantics, while SAO aligns cross-modal residuals in a shared semantic space using a loss $\mathcal{L}_{sao}$, yielding $\mathcal{L}=\mathcal{L}_{ce}+\mathcal{L}_{sao}$. Experiments on CREMA-D, UrbanSound8K-AV, and MNISTDVS-NTIDIGITS show state-of-the-art accuracy and strong noise robustness, demonstrating the practical potential of semantic-aligned multimodal SNNs.

Abstract

Humans interpret and perceive the world by integrating sensory information from multiple modalities, such as vision and hearing. Spiking Neural Networks (SNNs), as brain-inspired computational models, exhibit unique advantages in emulating the brain's information processing mechanisms. However, existing SNN models primarily focus on unimodal processing and lack efficient cross-modal information fusion, thereby limiting their effectiveness in real-world multimodal scenarios. To address this challenge, we propose a semantic-alignment cross-modal residual learning (S-CMRL) framework, a Transformer-based multimodal SNN architecture designed for effective audio-visual integration. S-CMRL leverages a spatiotemporal spiking attention mechanism to extract complementary features across modalities, and incorporates a cross-modal residual learning strategy to enhance feature integration. Additionally, a semantic alignment optimization mechanism is introduced to align cross-modal features within a shared semantic space, improving their consistency and complementarity. Extensive experiments on three benchmark datasets CREMA-D, UrbanSound8K-AV, and MNISTDVS-NTIDIGITS demonstrate that S-CMRL significantly outperforms existing multimodal SNN methods, achieving the state-of-the-art performance. The code is publicly available at https://github.com/Brain-Cog-Lab/S-CMRL.

Enhancing Audio-Visual Spiking Neural Networks through Semantic-Alignment and Cross-Modal Residual Learning

TL;DR

Abstract

Enhancing Audio-Visual Spiking Neural Networks through Semantic-Alignment and Cross-Modal Residual Learning

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (9)