Table of Contents
Fetching ...

SpikeMba: Multi-Modal Spiking Saliency Mamba for Temporal Video Grounding

Wenrui Li, Xiaopeng Hong, Ruiqin Xiong, Xiaopeng Fan

TL;DR

SpikeMba tackles temporal video grounding by marrying Spiking Neural Networks for precise saliency proposal generation with State Space Models for efficient long-range temporal reasoning, augmented by Relevant Slots that encode prior knowledge. The Contextual Moment Reasoner dynamically balances preserving context and exploring semantic relevance, and the Multi-modal Relevant Mamba blocks fuse visual and textual cues with dynamic information propagation. Through loss terms that align proposals, enhance saliency ranking, and regulate entropy, SpikeMba achieves state-of-the-art results across multiple benchmarks, demonstrating improved handling of confidence bias and long-term dependencies. The approach offers a scalable, cross-modal framework with potential impact on fine-grained video understanding tasks and efficient temporal grounding in real-world systems.

Abstract

Temporal video grounding (TVG) is a critical task in video content understanding, requiring precise alignment between video content and natural language instructions. Despite significant advancements, existing methods face challenges in managing confidence bias towards salient objects and capturing long-term dependencies in video sequences. To address these issues, we introduce SpikeMba: a multi-modal spiking saliency mamba for temporal video grounding. Our approach integrates Spiking Neural Networks (SNNs) with state space models (SSMs) to leverage their unique advantages in handling different aspects of the task. Specifically, we use SNNs to develop a spiking saliency detector that generates the proposal set. The detector emits spike signals when the input signal exceeds a predefined threshold, resulting in a dynamic and binary saliency proposal set. To enhance the model's capability to retain and infer contextual information, we introduce relevant slots which learnable tensors that encode prior knowledge. These slots work with the contextual moment reasoner to maintain a balance between preserving contextual information and exploring semantic relevance dynamically. The SSMs facilitate selective information propagation, addressing the challenge of long-term dependency in video content. By combining SNNs for proposal generation and SSMs for effective contextual reasoning, SpikeMba addresses confidence bias and long-term dependencies, thereby significantly enhancing fine-grained multimodal relationship capture. Our experiments demonstrate the effectiveness of SpikeMba, which consistently outperforms state-of-the-art methods across mainstream benchmarks.

SpikeMba: Multi-Modal Spiking Saliency Mamba for Temporal Video Grounding

TL;DR

SpikeMba tackles temporal video grounding by marrying Spiking Neural Networks for precise saliency proposal generation with State Space Models for efficient long-range temporal reasoning, augmented by Relevant Slots that encode prior knowledge. The Contextual Moment Reasoner dynamically balances preserving context and exploring semantic relevance, and the Multi-modal Relevant Mamba blocks fuse visual and textual cues with dynamic information propagation. Through loss terms that align proposals, enhance saliency ranking, and regulate entropy, SpikeMba achieves state-of-the-art results across multiple benchmarks, demonstrating improved handling of confidence bias and long-term dependencies. The approach offers a scalable, cross-modal framework with potential impact on fine-grained video understanding tasks and efficient temporal grounding in real-world systems.

Abstract

Temporal video grounding (TVG) is a critical task in video content understanding, requiring precise alignment between video content and natural language instructions. Despite significant advancements, existing methods face challenges in managing confidence bias towards salient objects and capturing long-term dependencies in video sequences. To address these issues, we introduce SpikeMba: a multi-modal spiking saliency mamba for temporal video grounding. Our approach integrates Spiking Neural Networks (SNNs) with state space models (SSMs) to leverage their unique advantages in handling different aspects of the task. Specifically, we use SNNs to develop a spiking saliency detector that generates the proposal set. The detector emits spike signals when the input signal exceeds a predefined threshold, resulting in a dynamic and binary saliency proposal set. To enhance the model's capability to retain and infer contextual information, we introduce relevant slots which learnable tensors that encode prior knowledge. These slots work with the contextual moment reasoner to maintain a balance between preserving contextual information and exploring semantic relevance dynamically. The SSMs facilitate selective information propagation, addressing the challenge of long-term dependency in video content. By combining SNNs for proposal generation and SSMs for effective contextual reasoning, SpikeMba addresses confidence bias and long-term dependencies, thereby significantly enhancing fine-grained multimodal relationship capture. Our experiments demonstrate the effectiveness of SpikeMba, which consistently outperforms state-of-the-art methods across mainstream benchmarks.
Paper Structure (12 sections, 5 equations, 4 figures, 4 tables, 2 algorithms)

This paper contains 12 sections, 5 equations, 4 figures, 4 tables, 2 algorithms.

Figures (4)

  • Figure 1: An illustration of the challenges in video grounding task (left) and the overall architecture of our proposed model (right). On the left, a scenario underscores the problem of confidence bias towards salient objects, with models overemphasizing dramatic changes. On the right, we employ the SNN and Mamba for dynamic proposal generation, encoding prior knowledge, and improving understanding of long-sequence videos.
  • Figure 2: Architectural Overview of the Multi-Modal Spiking Saliency Mamba. The contextual moment reasoner dynamically leverages relevant slots for semantic association and inference. The spiking saliency detector generates a potential proposal set. The multi-modal relevant mamba block enhances long-range dependency modeling while maintaining linear complexity relative to input size.
  • Figure 3: The ablation study of different spiking time step.
  • Figure 4: The qualitative results of proposed methods.