Table of Contents
Fetching ...

MambaVLT: Time-Evolving Multimodal State Space Model for Vision-Language Tracking

Xinqi Liu, Li Zhou, Zikun Zhou, Jianqiu Chen, Zhenyu He

TL;DR

A Mamba-based vision-language tracking model to exploit its state space evolving ability in temporal space for robust multimodal tracking, dubbed MambaVLT, which mainly integrates a time-evolving hybrid state space block and a selective locality enhancement block, to capture contextual information for multimodal modeling and adaptive reference feature update.

Abstract

The vision-language tracking task aims to perform object tracking based on various modality references. Existing Transformer-based vision-language tracking methods have made remarkable progress by leveraging the global modeling ability of self-attention. However, current approaches still face challenges in effectively exploiting the temporal information and dynamically updating reference features during tracking. Recently, the State Space Model (SSM), known as Mamba, has shown astonishing ability in efficient long-sequence modeling. Particularly, its state space evolving process demonstrates promising capabilities in memorizing multimodal temporal information with linear complexity. Witnessing its success, we propose a Mamba-based vision-language tracking model to exploit its state space evolving ability in temporal space for robust multimodal tracking, dubbed MambaVLT. In particular, our approach mainly integrates a time-evolving hybrid state space block and a selective locality enhancement block, to capture contextual information for multimodal modeling and adaptive reference feature update. Besides, we introduce a modality-selection module that dynamically adjusts the weighting between visual and language references, mitigating potential ambiguities from either reference type. Extensive experimental results show that our method performs favorably against state-of-the-art trackers across diverse benchmarks.

MambaVLT: Time-Evolving Multimodal State Space Model for Vision-Language Tracking

TL;DR

A Mamba-based vision-language tracking model to exploit its state space evolving ability in temporal space for robust multimodal tracking, dubbed MambaVLT, which mainly integrates a time-evolving hybrid state space block and a selective locality enhancement block, to capture contextual information for multimodal modeling and adaptive reference feature update.

Abstract

The vision-language tracking task aims to perform object tracking based on various modality references. Existing Transformer-based vision-language tracking methods have made remarkable progress by leveraging the global modeling ability of self-attention. However, current approaches still face challenges in effectively exploiting the temporal information and dynamically updating reference features during tracking. Recently, the State Space Model (SSM), known as Mamba, has shown astonishing ability in efficient long-sequence modeling. Particularly, its state space evolving process demonstrates promising capabilities in memorizing multimodal temporal information with linear complexity. Witnessing its success, we propose a Mamba-based vision-language tracking model to exploit its state space evolving ability in temporal space for robust multimodal tracking, dubbed MambaVLT. In particular, our approach mainly integrates a time-evolving hybrid state space block and a selective locality enhancement block, to capture contextual information for multimodal modeling and adaptive reference feature update. Besides, we introduce a modality-selection module that dynamically adjusts the weighting between visual and language references, mitigating potential ambiguities from either reference type. Extensive experimental results show that our method performs favorably against state-of-the-art trackers across diverse benchmarks.

Paper Structure

This paper contains 24 sections, 10 equations, 9 figures, 6 tables.

Figures (9)

  • Figure 1: Illustration of two ways for capturing temporal context information. (a) Vision-language tracker with discrete context prompt. (b) Our MambaVLT with continuous time-evolving state space for temporal information transmission.
  • Figure 2: Overview of the MambaVLT. Given various modality reference settings, features are initially extracted and aligned, then forwarded to the time-evolving multimodal fusion module. Subsequently, these features are input into the localization module to obtain precise localization information. MambaVLT performs temporal information-aware vision-language tracking with adaptive reference feature updating. Note that 'NA' indicates when the corresponding reference is not provided.
  • Figure 3: Overall pipeline of the Hybrid Multimodal State Space Block. The multimodal feature includes language feature $F_l$, template feature $F_z$ and search region feature $F_X$. The Hybrid Multimodal State Space block is for time-evolving global modeling and reference feature updating. Then, the Selective Locality Enhancement block will enhance the features of the current tracking frame. $\mathbf{{H}}^{ini}_{t}$ and $\mathbf{{H}}^{fin}_{t}$ denote the initial state space and final state space. local scan represents the linear attention scan. $\boldsymbol{A_l}$ represents the global selective map.
  • Figure 4: Overview of modality-selection module. $w_l$ and $w_z$ represents the weights of language invariant clue $P_l$ and template invariant clue $P_z$.
  • Figure 5: Qualitative comparison of NL$\&$BBOX tracking task on two challenging sequences to analyze the effectiveness of state space. The line graphs represent the IoU of different trackers for each frame. The SRF means semi-reference-free tracking setting.
  • ...and 4 more figures