Spiking Neural Networks with Temporal Attention-Guided Adaptive Fusion for imbalanced Multi-modal Learning
Jiangrong Shen, Yulin Xie, Qi Xu, Gang Pan, Huajin Tang, Badong Chen
TL;DR
This work tackles modality imbalance and temporal misalignment in multimodal spiking neural networks by introducing a Temporal Attention-guided Adaptive Fusion (TAAF) module and a temporal adaptive balanced fusion loss. The per-timestep attention scores guide both feature fusion and loss weighting, while gradient modulation balances learning rates across modalities to prevent dominance. Empirical results on CREMA-D, AVE, and EAD show state-of-the-art accuracy with substantially fewer time steps and improved energy efficiency, validating the approach's ability to leverage temporal dynamics for robust multisensory integration. The method advances neuromorphic multimodal learning by aligning temporal representations with biological principles and enabling practical deployment on energy-constrained hardware.
Abstract
Multimodal spiking neural networks (SNNs) hold significant potential for energy-efficient sensory processing but face critical challenges in modality imbalance and temporal misalignment. Current approaches suffer from uncoordinated convergence speeds across modalities and static fusion mechanisms that ignore time-varying cross-modal interactions. We propose the temporal attention-guided adaptive fusion framework for multimodal SNNs with two synergistic innovations: 1) The Temporal Attention-guided Adaptive Fusion (TAAF) module that dynamically assigns importance scores to fused spiking features at each timestep, enabling hierarchical integration of temporally heterogeneous spike-based features; 2) The temporal adaptive balanced fusion loss that modulates learning rates per modality based on the above attention scores, preventing dominant modalities from monopolizing optimization. The proposed framework implements adaptive fusion, especially in the temporal dimension, and alleviates the modality imbalance during multimodal learning, mimicking cortical multisensory integration principles. Evaluations on CREMA-D, AVE, and EAD datasets demonstrate state-of-the-art performance (77.55\%, 70.65\% and 97.5\%accuracy, respectively) with energy efficiency. The system resolves temporal misalignment through learnable time-warping operations and faster modality convergence coordination than baseline SNNs. This work establishes a new paradigm for temporally coherent multimodal learning in neuromorphic systems, bridging the gap between biological sensory processing and efficient machine intelligence.
