Table of Contents
Fetching ...

Multi-Timescale Motion-Decoupled Spiking Transformer for Audio-Visual Zero-Shot Learning

Wenrui Li, Penghong Wang, Xingtao Wang, Wangmeng Zuo, Xiaopeng Fan, Yonghong Tian

TL;DR

This work tackles background scene bias and insufficient motion detail in audio-visual zero-shot learning by introducing MDST++, a dual-stream architecture that decouples contextual semantics from sparse motion using an Event Generation Model and Spiking Transformer. A Recurrent Joint Learning Unit fuses audio-visual semantics, while a Discrepancy Analysis Block and dynamic thresholding enhance audio-motion reasoning and robust temporal processing with Spiking Neural Networks. The framework further advances with SpikeFormer and multi-stage timestep shrinkage to capture long-range, multi-scale temporal dependencies, fused through a Cross-Modal Reasoning Module and optimized by joint triplet, projection, and reconstruction losses. Across ActivityNet, UCF101, and VGGSound, MDST/MDST++ consistently surpass state-of-the-art methods in both ZSL and GZSL, with substantial gains in harmonic mean and unseen-class accuracy, highlighting the approach’s potential for energy-efficient neuromorphic deployment and robust video understanding in challenging settings.

Abstract

Audio-visual zero-shot learning (ZSL) has been extensively researched for its capability to classify video data from unseen classes during training. Nevertheless, current methodologies often struggle with background scene biases and inadequate motion detail. This paper proposes a novel dual-stream Multi-Timescale Motion-Decoupled Spiking Transformer (MDST++), which decouples contextual semantic information and sparse dynamic motion information. The recurrent joint learning unit is proposed to extract contextual semantic information and capture joint knowledge across various modalities to understand the environment of actions. By converting RGB images to events, our method captures motion information more accurately and mitigates background scene biases. Moreover, we introduce a discrepancy analysis block to model audio motion information. To enhance the robustness of SNNs in extracting temporal and motion cues, we dynamically adjust the threshold of Leaky Integrate-and-Fire neurons based on global motion and contextual semantic information. Our experiments validate the effectiveness of MDST++, demonstrating their consistent superiority over state-of-the-art methods on mainstream benchmarks. Additionally, incorporating motion and multi-timescale information significantly improves HM and ZSL accuracy by 26.2\% and 39.9\%.

Multi-Timescale Motion-Decoupled Spiking Transformer for Audio-Visual Zero-Shot Learning

TL;DR

This work tackles background scene bias and insufficient motion detail in audio-visual zero-shot learning by introducing MDST++, a dual-stream architecture that decouples contextual semantics from sparse motion using an Event Generation Model and Spiking Transformer. A Recurrent Joint Learning Unit fuses audio-visual semantics, while a Discrepancy Analysis Block and dynamic thresholding enhance audio-motion reasoning and robust temporal processing with Spiking Neural Networks. The framework further advances with SpikeFormer and multi-stage timestep shrinkage to capture long-range, multi-scale temporal dependencies, fused through a Cross-Modal Reasoning Module and optimized by joint triplet, projection, and reconstruction losses. Across ActivityNet, UCF101, and VGGSound, MDST/MDST++ consistently surpass state-of-the-art methods in both ZSL and GZSL, with substantial gains in harmonic mean and unseen-class accuracy, highlighting the approach’s potential for energy-efficient neuromorphic deployment and robust video understanding in challenging settings.

Abstract

Audio-visual zero-shot learning (ZSL) has been extensively researched for its capability to classify video data from unseen classes during training. Nevertheless, current methodologies often struggle with background scene biases and inadequate motion detail. This paper proposes a novel dual-stream Multi-Timescale Motion-Decoupled Spiking Transformer (MDST++), which decouples contextual semantic information and sparse dynamic motion information. The recurrent joint learning unit is proposed to extract contextual semantic information and capture joint knowledge across various modalities to understand the environment of actions. By converting RGB images to events, our method captures motion information more accurately and mitigates background scene biases. Moreover, we introduce a discrepancy analysis block to model audio motion information. To enhance the robustness of SNNs in extracting temporal and motion cues, we dynamically adjust the threshold of Leaky Integrate-and-Fire neurons based on global motion and contextual semantic information. Our experiments validate the effectiveness of MDST++, demonstrating their consistent superiority over state-of-the-art methods on mainstream benchmarks. Additionally, incorporating motion and multi-timescale information significantly improves HM and ZSL accuracy by 26.2\% and 39.9\%.

Paper Structure

This paper contains 39 sections, 19 equations, 10 figures, 9 tables, 2 algorithms.

Figures (10)

  • Figure 1: To mitigate background bias and highlight the differences among similar types of videos, we converted RGB images into events. This conversion only occurs when there are significant changes in the background scene.
  • Figure 2: The MDST architecture combines visual, audio, and textual features (represented by blue, green, and pink lines, respectively) using a two-stream design to extract scene contextual semantics and motion information separately. The "threshold adjustment" block dynamically modifies the SNN thresholds ($V_{th/vis}^{t}$ and $V_{th/adu}^{t}$) to regulate neuron firing rates and reduce potential noise efficiently.
  • Figure 3: An illustration of a LIF neuron. The membrane potential $V(t-1)$ and spike $S(t-1)$ at time $t-1$ are derived from $V(t-2)$ and $S(t-2)$, and processed to produce $U(t)$ and $S(t)$ at time $t$.
  • Figure 4: Main differences between MDST and MDST++. In MDST, a simple SNN is built using three linear layers with LIF neurons. In MDST++, self-attention and Transformers are integrated to improve SNN's learning capability. Additionally, to leverage multi-scale time features, MDST++ divides the SNN into $m$ stages with progressively compressed time steps ($T_1 > T_2 > ... > T_m$). The outputs at different time steps are used to calculate losses and train the model.
  • Figure 5: The illustration of dynamic threshold block, which adaptively modifies the threshold $V_{th}^{t}$ of LIF neurons according to the statistics of scene contextual semantics and motion features.
  • ...and 5 more figures