Table of Contents
Fetching ...

MambaLCT: Boosting Tracking via Long-term Context State Space Model

Xiaohai Li, Bineng Zhong, Qihua Liang, Guorong Li, Zhiyi Mo, Shuxiang Song

TL;DR

The paper tackles the limitation of short-term context in visual tracking by introducing MambaLCT, a framework that builds long-term target variation cues from the first frame to the current frame using a unidirectional Context Mamba module. It unifies context and appearance modeling via a ucaEncoder and cross-frame tokens, with target information accumulated in hidden states $H_i^t$ and transmitted as $Y_i^T$ to guide frame-to-frame similarity. The training objective combines $L_{cls}$, $L_1$, and $L_{GIoU}$ as $L = L_{cls} + \lambda_1 L_1 + \lambda_2 L_{GIoU}$, and the method leverages a HiViT backbone with a Vim-Small Mamba, achieving state-of-the-art performance on six benchmarks including LaSOT, LaSOT$_{ext}$, GOT-10K, TrackingNet, TNL2K, and UAV123, while maintaining real-time speeds. This long-term context integration improves robustness in challenging scenarios such as occlusion and deformation, offering a practical advance for long-duration tracking tasks.

Abstract

Effectively constructing context information with long-term dependencies from video sequences is crucial for object tracking. However, the context length constructed by existing work is limited, only considering object information from adjacent frames or video clips, leading to insufficient utilization of contextual information. To address this issue, we propose MambaLCT, which constructs and utilizes target variation cues from the first frame to the current frame for robust tracking. First, a novel unidirectional Context Mamba module is designed to scan frame features along the temporal dimension, gathering target change cues throughout the entire sequence. Specifically, target-related information in frame features is compressed into a hidden state space through selective scanning mechanism. The target information across the entire video is continuously aggregated into target variation cues. Next, we inject the target change cues into the attention mechanism, providing temporal information for modeling the relationship between the template and search frames. The advantage of MambaLCT is its ability to continuously extend the length of the context, capturing complete target change cues, which enhances the stability and robustness of the tracker. Extensive experiments show that long-term context information enhances the model's ability to perceive targets in complex scenarios. MambaLCT achieves new SOTA performance on six benchmarks while maintaining real-time running speeds.

MambaLCT: Boosting Tracking via Long-term Context State Space Model

TL;DR

The paper tackles the limitation of short-term context in visual tracking by introducing MambaLCT, a framework that builds long-term target variation cues from the first frame to the current frame using a unidirectional Context Mamba module. It unifies context and appearance modeling via a ucaEncoder and cross-frame tokens, with target information accumulated in hidden states and transmitted as to guide frame-to-frame similarity. The training objective combines , , and as , and the method leverages a HiViT backbone with a Vim-Small Mamba, achieving state-of-the-art performance on six benchmarks including LaSOT, LaSOT, GOT-10K, TrackingNet, TNL2K, and UAV123, while maintaining real-time speeds. This long-term context integration improves robustness in challenging scenarios such as occlusion and deformation, offering a practical advance for long-duration tracking tasks.

Abstract

Effectively constructing context information with long-term dependencies from video sequences is crucial for object tracking. However, the context length constructed by existing work is limited, only considering object information from adjacent frames or video clips, leading to insufficient utilization of contextual information. To address this issue, we propose MambaLCT, which constructs and utilizes target variation cues from the first frame to the current frame for robust tracking. First, a novel unidirectional Context Mamba module is designed to scan frame features along the temporal dimension, gathering target change cues throughout the entire sequence. Specifically, target-related information in frame features is compressed into a hidden state space through selective scanning mechanism. The target information across the entire video is continuously aggregated into target variation cues. Next, we inject the target change cues into the attention mechanism, providing temporal information for modeling the relationship between the template and search frames. The advantage of MambaLCT is its ability to continuously extend the length of the context, capturing complete target change cues, which enhances the stability and robustness of the tracker. Extensive experiments show that long-term context information enhances the model's ability to perceive targets in complex scenarios. MambaLCT achieves new SOTA performance on six benchmarks while maintaining real-time running speeds.

Paper Structure

This paper contains 15 sections, 10 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Comparison between current SOT context information construction paradigm and our method. (a) and (b) construct short-term context information within the scope of frames or video clips our3EVPTrackour1ODtrack. (c) Our propose MambaLCT analyzes the complete video sequence to construct long-term context information.
  • Figure 2: Overview of our framework. The input video frames are converted into tokens through patch embedding. Then, these tokens, along with the contextual information, are fed into the ucaEncoder for unified modeling of the contextual and appearance information. During the temporal scanning process, the representational information of the images is continuously fed into the Context Mamba module to construct the target's change cues.
  • Figure 3: Illustration of the process of constructing and propagating context information. On the left is the structure of the ucaEncoder, and on the right is the process of constructing contextual information.
  • Figure 4: Attribute-based evaluation on the LaSOT test set. AUC score is used to rank different trackers.
  • Figure 5: On the LaSOT benchmark, we visualized the comparison results of our tracker with three SOTA trackers across three challenges.
  • ...and 1 more figures