Table of Contents
Fetching ...

MambaVT: Spatio-Temporal Contextual Modeling for robust RGB-T Tracking

Simiao Lai, Chang Liu, Jiawen Zhu, Ben Kang, Yang Liu, Dong Wang, Huchuan Lu

TL;DR

This paper tackles robust RGB-T tracking by addressing the limited long-range temporal modeling and the quadratic complexity of attention-based transformers. It introduces MambaVT, a pure Mamba-based framework that achieves linear sequence modeling complexity $O(L)$ through long-range cross-frame integration and short-term historical trajectory prompts within a single-stream architecture. The approach uses video-level RGB/TIR templates and bidirectional Mamba encoders to model global spatio-temporal context, complemented by trajectory-based local motion cues and a simple online template memory strategy. Empirical results on four RGB-T benchmarks demonstrate state-of-the-art performance with real-time efficiency, validating the effectiveness of combining global and local context under linear complexity. The work provides a strong, lightweight baseline for multi-modal tracking and suggests directions for future improvements in both software and hardware optimization.

Abstract

Existing RGB-T tracking algorithms have made remarkable progress by leveraging the global interaction capability and extensive pre-trained models of the Transformer architecture. Nonetheless, these methods mainly adopt imagepair appearance matching and face challenges of the intrinsic high quadratic complexity of the attention mechanism, resulting in constrained exploitation of temporal information. Inspired by the recently emerged State Space Model Mamba, renowned for its impressive long sequence modeling capabilities and linear computational complexity, this work innovatively proposes a pure Mamba-based framework (MambaVT) to fully exploit spatio-temporal contextual modeling for robust visible-thermal tracking. Specifically, we devise the long-range cross-frame integration component to globally adapt to target appearance variations, and introduce short-term historical trajectory prompts to predict the subsequent target states based on local temporal location clues. Extensive experiments show the significant potential of vision Mamba for RGB-T tracking, with MambaVT achieving state-of-the-art performance on four mainstream benchmarks while requiring lower computational costs. We aim for this work to serve as a simple yet strong baseline, stimulating future research in this field. The code and pre-trained models will be made available.

MambaVT: Spatio-Temporal Contextual Modeling for robust RGB-T Tracking

TL;DR

This paper tackles robust RGB-T tracking by addressing the limited long-range temporal modeling and the quadratic complexity of attention-based transformers. It introduces MambaVT, a pure Mamba-based framework that achieves linear sequence modeling complexity through long-range cross-frame integration and short-term historical trajectory prompts within a single-stream architecture. The approach uses video-level RGB/TIR templates and bidirectional Mamba encoders to model global spatio-temporal context, complemented by trajectory-based local motion cues and a simple online template memory strategy. Empirical results on four RGB-T benchmarks demonstrate state-of-the-art performance with real-time efficiency, validating the effectiveness of combining global and local context under linear complexity. The work provides a strong, lightweight baseline for multi-modal tracking and suggests directions for future improvements in both software and hardware optimization.

Abstract

Existing RGB-T tracking algorithms have made remarkable progress by leveraging the global interaction capability and extensive pre-trained models of the Transformer architecture. Nonetheless, these methods mainly adopt imagepair appearance matching and face challenges of the intrinsic high quadratic complexity of the attention mechanism, resulting in constrained exploitation of temporal information. Inspired by the recently emerged State Space Model Mamba, renowned for its impressive long sequence modeling capabilities and linear computational complexity, this work innovatively proposes a pure Mamba-based framework (MambaVT) to fully exploit spatio-temporal contextual modeling for robust visible-thermal tracking. Specifically, we devise the long-range cross-frame integration component to globally adapt to target appearance variations, and introduce short-term historical trajectory prompts to predict the subsequent target states based on local temporal location clues. Extensive experiments show the significant potential of vision Mamba for RGB-T tracking, with MambaVT achieving state-of-the-art performance on four mainstream benchmarks while requiring lower computational costs. We aim for this work to serve as a simple yet strong baseline, stimulating future research in this field. The code and pre-trained models will be made available.
Paper Structure (14 sections, 8 equations, 4 figures, 4 tables)

This paper contains 14 sections, 8 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Framework and efficiency comparisons of our proposed methods. (a) Unlike existing RGB-T tracking methods that usually take a single image pair as input and overlook temporal information, our framework uses vision Mamba to fully exploit the spatio-temporal contexts from the perspective of long-range appearance modeling and short-term motion modeling. (b) The FLOPs and GPU memory usage of transformer-based methods grow quadratically as the number of frames increases and become unbearable. In contrast, our Mamba-based method scales linearly, making it efficient. Note that batch size is 1.
  • Figure 2: Overall framework of our MambaVT. The video-level templates set and search region training samples are fed into patch embedding layer, and historical trajectory prompts are fed into coordinate embedding layer. The embedded vectors are all sent into bidirectional Mamba encoder for unified contextual modeling. Ultimately, the search region vectors are used for predicting object state and coordinate query vector is used for auxiliary supervision with ground-truth bounding box.
  • Figure 3: Various data input modes. (a) Concatenation variants of templates and search region vectors. (b) Different scan orientations of scan operator in SSM. "t", "s" and "f" refer to template, search region and frame, respectively.
  • Figure 4: Qualitative comparison: before vs. after incorporating trajectory motion information