Table of Contents
Fetching ...

Spiking Tucker Fusion Transformer for Audio-Visual Zero-Shot Learning

Wenrui Li, Penghong Wang, Ruiqin Xiong, Xiaopeng Fan

TL;DR

The paper proposes STFT, a Spiking Tucker Fusion Transformer, to address the challenge of coupling binary SNN temporal encoding with floating-point Transformer semantics in audio-visual zero-shot learning. It introduces Time-Step Factor (TSF), Global-Local Pooling (GLP), and dynamic neuron thresholds, plus a temporal-semantic Tucker fusion module to enable multi-scale, full second-order interactions between modalities. Through latent knowledge slots and a cross-modal transformer, STFT achieves state-of-the-art harmonic-mean performance on VGGSound, UCF101, and ActivityNet, while reducing parameter count and computational cost relative to strong baselines. The work demonstrates substantial gains in cross-modal fusion, temporal reasoning, and robustness, with scalable architecture suitable for larger AV datasets and future dynamic rank adaptations.

Abstract

The spiking neural networks (SNNs) that efficiently encode temporal sequences have shown great potential in extracting audio-visual joint feature representations. However, coupling SNNs (binary spike sequences) with transformers (float-point sequences) to jointly explore the temporal-semantic information still facing challenges. In this paper, we introduce a novel Spiking Tucker Fusion Transformer (STFT) for audio-visual zero-shot learning (ZSL). The STFT leverage the temporal and semantic information from different time steps to generate robust representations. The time-step factor (TSF) is introduced to dynamically synthesis the subsequent inference information. To guide the formation of input membrane potentials and reduce the spike noise, we propose a global-local pooling (GLP) which combines the max and average pooling operations. Furthermore, the thresholds of the spiking neurons are dynamically adjusted based on semantic and temporal cues. Integrating the temporal and semantic information extracted by SNNs and Transformers are difficult due to the increased number of parameters in a straightforward bilinear model. To address this, we introduce a temporal-semantic Tucker fusion module, which achieves multi-scale fusion of SNN and Transformer outputs while maintaining full second-order interactions. Our experimental results demonstrate the effectiveness of the proposed approach in achieving state-of-the-art performance in three benchmark datasets. The harmonic mean (HM) improvement of VGGSound, UCF101 and ActivityNet are around 15.4\%, 3.9\%, and 14.9\%, respectively.

Spiking Tucker Fusion Transformer for Audio-Visual Zero-Shot Learning

TL;DR

The paper proposes STFT, a Spiking Tucker Fusion Transformer, to address the challenge of coupling binary SNN temporal encoding with floating-point Transformer semantics in audio-visual zero-shot learning. It introduces Time-Step Factor (TSF), Global-Local Pooling (GLP), and dynamic neuron thresholds, plus a temporal-semantic Tucker fusion module to enable multi-scale, full second-order interactions between modalities. Through latent knowledge slots and a cross-modal transformer, STFT achieves state-of-the-art harmonic-mean performance on VGGSound, UCF101, and ActivityNet, while reducing parameter count and computational cost relative to strong baselines. The work demonstrates substantial gains in cross-modal fusion, temporal reasoning, and robustness, with scalable architecture suitable for larger AV datasets and future dynamic rank adaptations.

Abstract

The spiking neural networks (SNNs) that efficiently encode temporal sequences have shown great potential in extracting audio-visual joint feature representations. However, coupling SNNs (binary spike sequences) with transformers (float-point sequences) to jointly explore the temporal-semantic information still facing challenges. In this paper, we introduce a novel Spiking Tucker Fusion Transformer (STFT) for audio-visual zero-shot learning (ZSL). The STFT leverage the temporal and semantic information from different time steps to generate robust representations. The time-step factor (TSF) is introduced to dynamically synthesis the subsequent inference information. To guide the formation of input membrane potentials and reduce the spike noise, we propose a global-local pooling (GLP) which combines the max and average pooling operations. Furthermore, the thresholds of the spiking neurons are dynamically adjusted based on semantic and temporal cues. Integrating the temporal and semantic information extracted by SNNs and Transformers are difficult due to the increased number of parameters in a straightforward bilinear model. To address this, we introduce a temporal-semantic Tucker fusion module, which achieves multi-scale fusion of SNN and Transformer outputs while maintaining full second-order interactions. Our experimental results demonstrate the effectiveness of the proposed approach in achieving state-of-the-art performance in three benchmark datasets. The harmonic mean (HM) improvement of VGGSound, UCF101 and ActivityNet are around 15.4\%, 3.9\%, and 14.9\%, respectively.
Paper Structure (34 sections, 18 equations, 5 figures, 11 tables)

This paper contains 34 sections, 18 equations, 5 figures, 11 tables.

Figures (5)

  • Figure 1: The illustration of our proposed STFT for audio-visual GZSL. The SNN utilize the time-step factor to dynamic synthesis the output of the temporal information. The audio and visual encoder utilize the latent knowledge combiner to explore the semantic information with latent cues. After temporal-semantic tucker fusion, the fused features are further reasoned through the cross-modal transformer. The information from seen training classes could transfer to unseen test classes by textual embeddings.
  • Figure 2: The overall architecture of STFT. The SNN thresholds are adjusted dynamically based on the semantic and temporal information cues. The spatial-temporal SNN using the GLP to refine the input features, combining the time-step factor to optimize the final output. The latent knowledge slots $\boldsymbol{K}_{t}$ could explore and align the latent semantic relationships of different modalities. The cross-modal transformer in joint reasoning module are shared weight.
  • Figure 3: The ablation study of the impact of different time step, rank constraint and fix thresholds to HM and ZSL performance on UCF101 dataset.
  • Figure 4: Visualization examples on UCF101. We give t-SNE visualization results for five categories which can be categorized into two parent classes: "Sports" and "Instrument".
  • Figure 5: Qualitative comparison results compared with MDFT.