Spiking Tucker Fusion Transformer for Audio-Visual Zero-Shot Learning
Wenrui Li, Penghong Wang, Ruiqin Xiong, Xiaopeng Fan
TL;DR
The paper proposes STFT, a Spiking Tucker Fusion Transformer, to address the challenge of coupling binary SNN temporal encoding with floating-point Transformer semantics in audio-visual zero-shot learning. It introduces Time-Step Factor (TSF), Global-Local Pooling (GLP), and dynamic neuron thresholds, plus a temporal-semantic Tucker fusion module to enable multi-scale, full second-order interactions between modalities. Through latent knowledge slots and a cross-modal transformer, STFT achieves state-of-the-art harmonic-mean performance on VGGSound, UCF101, and ActivityNet, while reducing parameter count and computational cost relative to strong baselines. The work demonstrates substantial gains in cross-modal fusion, temporal reasoning, and robustness, with scalable architecture suitable for larger AV datasets and future dynamic rank adaptations.
Abstract
The spiking neural networks (SNNs) that efficiently encode temporal sequences have shown great potential in extracting audio-visual joint feature representations. However, coupling SNNs (binary spike sequences) with transformers (float-point sequences) to jointly explore the temporal-semantic information still facing challenges. In this paper, we introduce a novel Spiking Tucker Fusion Transformer (STFT) for audio-visual zero-shot learning (ZSL). The STFT leverage the temporal and semantic information from different time steps to generate robust representations. The time-step factor (TSF) is introduced to dynamically synthesis the subsequent inference information. To guide the formation of input membrane potentials and reduce the spike noise, we propose a global-local pooling (GLP) which combines the max and average pooling operations. Furthermore, the thresholds of the spiking neurons are dynamically adjusted based on semantic and temporal cues. Integrating the temporal and semantic information extracted by SNNs and Transformers are difficult due to the increased number of parameters in a straightforward bilinear model. To address this, we introduce a temporal-semantic Tucker fusion module, which achieves multi-scale fusion of SNN and Transformer outputs while maintaining full second-order interactions. Our experimental results demonstrate the effectiveness of the proposed approach in achieving state-of-the-art performance in three benchmark datasets. The harmonic mean (HM) improvement of VGGSound, UCF101 and ActivityNet are around 15.4\%, 3.9\%, and 14.9\%, respectively.
