Table of Contents
Fetching ...

MambaEVT: Event Stream based Visual Object Tracking using State Space Model

Xiao Wang, Chao wang, Shiao Wang, Xixi Wang, Zhicheng Zhao, Lin Zhu, Bo Jiang

TL;DR

This paper tackles event camera visual object tracking by addressing the high computational cost of transformer backbones and the rigidity of static templates. It introduces MambaEVT, a pure Mamba-based tracker with a Vision Mamba backbone that jointly extracts and fuses features from template, dynamic template, and search regions, aided by a Memory Mamba module to generate adaptive dynamic templates. The Memory Mamba-based dynamic template update uses long- and short-term template libraries to capture appearance changes, improving robustness while maintaining efficiency. Experiments on EventVOT, VisEvent, and FE240hz show competitive accuracy with substantially fewer parameters than state-of-the-art methods, and ablations validate the benefits of the dynamic template strategy and backbone design. The work advances event-based tracking by combining linear-complexity sequence modeling with learnable online template updating, and provides code for reproducibility and further research.

Abstract

Event camera-based visual tracking has drawn more and more attention in recent years due to the unique imaging principle and advantages of low energy consumption, high dynamic range, and dense temporal resolution. Current event-based tracking algorithms are gradually hitting their performance bottlenecks, due to the utilization of vision Transformer and the static template for target object localization. In this paper, we propose a novel Mamba-based visual tracking framework that adopts the state space model with linear complexity as a backbone network. The search regions and target template are fed into the vision Mamba network for simultaneous feature extraction and interaction. The output tokens of search regions will be fed into the tracking head for target localization. More importantly, we consider introducing a dynamic template update strategy into the tracking framework using the Memory Mamba network. By considering the diversity of samples in the target template library and making appropriate adjustments to the template memory module, a more effective dynamic template can be integrated. The effective combination of dynamic and static templates allows our Mamba-based tracking algorithm to achieve a good balance between accuracy and computational cost on multiple large-scale datasets, including EventVOT, VisEvent, and FE240hz. The source code will be released on https://github.com/Event-AHU/MambaEVT

MambaEVT: Event Stream based Visual Object Tracking using State Space Model

TL;DR

This paper tackles event camera visual object tracking by addressing the high computational cost of transformer backbones and the rigidity of static templates. It introduces MambaEVT, a pure Mamba-based tracker with a Vision Mamba backbone that jointly extracts and fuses features from template, dynamic template, and search regions, aided by a Memory Mamba module to generate adaptive dynamic templates. The Memory Mamba-based dynamic template update uses long- and short-term template libraries to capture appearance changes, improving robustness while maintaining efficiency. Experiments on EventVOT, VisEvent, and FE240hz show competitive accuracy with substantially fewer parameters than state-of-the-art methods, and ablations validate the benefits of the dynamic template strategy and backbone design. The work advances event-based tracking by combining linear-complexity sequence modeling with learnable online template updating, and provides code for reproducibility and further research.

Abstract

Event camera-based visual tracking has drawn more and more attention in recent years due to the unique imaging principle and advantages of low energy consumption, high dynamic range, and dense temporal resolution. Current event-based tracking algorithms are gradually hitting their performance bottlenecks, due to the utilization of vision Transformer and the static template for target object localization. In this paper, we propose a novel Mamba-based visual tracking framework that adopts the state space model with linear complexity as a backbone network. The search regions and target template are fed into the vision Mamba network for simultaneous feature extraction and interaction. The output tokens of search regions will be fed into the tracking head for target localization. More importantly, we consider introducing a dynamic template update strategy into the tracking framework using the Memory Mamba network. By considering the diversity of samples in the target template library and making appropriate adjustments to the template memory module, a more effective dynamic template can be integrated. The effective combination of dynamic and static templates allows our Mamba-based tracking algorithm to achieve a good balance between accuracy and computational cost on multiple large-scale datasets, including EventVOT, VisEvent, and FE240hz. The source code will be released on https://github.com/Event-AHU/MambaEVT
Paper Structure (20 sections, 8 equations, 5 figures, 10 tables)

This paper contains 20 sections, 8 equations, 5 figures, 10 tables.

Figures (5)

  • Figure 1: Comparison between our tracker and existing SOTA trackers on the Parameters, Accuracy, and GPU Memory Consumption (GB) on the EventVOT dataset. Note that, the size of the bubbles represents the amount of GPU memory usage; larger bubbles indicate a higher consumption.
  • Figure 2: An overview of our proposed pure Mamba-based visual object tracking using an event camera, termed MambaEVT. The first key feature of our tracking framework is the vision Mamba based backbone network which can achieve feature extraction, interaction and fusion, simultaneously. It ensures our tracker achieves good performance and also lowers computational cost. The second one is the learnable Memory Mamba for dynamic template generation which makes our tracker more robust to significant appearance variation.
  • Figure 3: Tracking results under various challenging factors.
  • Figure 4: Activation maps predicted by our proposed MambaEVT framework.
  • Figure 5: Visualization of the tracking results of MambaEVT and other SOTA trackers.