SpikeMba: Multi-Modal Spiking Saliency Mamba for Temporal Video Grounding
Wenrui Li, Xiaopeng Hong, Ruiqin Xiong, Xiaopeng Fan
TL;DR
SpikeMba tackles temporal video grounding by marrying Spiking Neural Networks for precise saliency proposal generation with State Space Models for efficient long-range temporal reasoning, augmented by Relevant Slots that encode prior knowledge. The Contextual Moment Reasoner dynamically balances preserving context and exploring semantic relevance, and the Multi-modal Relevant Mamba blocks fuse visual and textual cues with dynamic information propagation. Through loss terms that align proposals, enhance saliency ranking, and regulate entropy, SpikeMba achieves state-of-the-art results across multiple benchmarks, demonstrating improved handling of confidence bias and long-term dependencies. The approach offers a scalable, cross-modal framework with potential impact on fine-grained video understanding tasks and efficient temporal grounding in real-world systems.
Abstract
Temporal video grounding (TVG) is a critical task in video content understanding, requiring precise alignment between video content and natural language instructions. Despite significant advancements, existing methods face challenges in managing confidence bias towards salient objects and capturing long-term dependencies in video sequences. To address these issues, we introduce SpikeMba: a multi-modal spiking saliency mamba for temporal video grounding. Our approach integrates Spiking Neural Networks (SNNs) with state space models (SSMs) to leverage their unique advantages in handling different aspects of the task. Specifically, we use SNNs to develop a spiking saliency detector that generates the proposal set. The detector emits spike signals when the input signal exceeds a predefined threshold, resulting in a dynamic and binary saliency proposal set. To enhance the model's capability to retain and infer contextual information, we introduce relevant slots which learnable tensors that encode prior knowledge. These slots work with the contextual moment reasoner to maintain a balance between preserving contextual information and exploring semantic relevance dynamically. The SSMs facilitate selective information propagation, addressing the challenge of long-term dependency in video content. By combining SNNs for proposal generation and SSMs for effective contextual reasoning, SpikeMba addresses confidence bias and long-term dependencies, thereby significantly enhancing fine-grained multimodal relationship capture. Our experiments demonstrate the effectiveness of SpikeMba, which consistently outperforms state-of-the-art methods across mainstream benchmarks.
