Table of Contents
Fetching ...

Exploiting Multimodal Spatial-temporal Patterns for Video Object Tracking

Xiantao Hu, Ying Tai, Xu Zhao, Chen Zhao, Zhenyu Zhang, Jun Li, Bineng Zhong, Jian Yang

TL;DR

A temporal state generator (TSG) that continuously generates a sequence of tokens containing multimodal temporal information that is used to guide the localization of the target in the next time state, establish long-range contextual relationships between video frames, and capture the temporal trajectory of the target.

Abstract

Multimodal tracking has garnered widespread attention as a result of its ability to effectively address the inherent limitations of traditional RGB tracking. However, existing multimodal trackers mainly focus on the fusion and enhancement of spatial features or merely leverage the sparse temporal relationships between video frames. These approaches do not fully exploit the temporal correlations in multimodal videos, making it difficult to capture the dynamic changes and motion information of targets in complex scenarios. To alleviate this problem, we propose a unified multimodal spatial-temporal tracking approach named STTrack. In contrast to previous paradigms that solely relied on updating reference information, we introduced a temporal state generator (TSG) that continuously generates a sequence of tokens containing multimodal temporal information. These temporal information tokens are used to guide the localization of the target in the next time state, establish long-range contextual relationships between video frames, and capture the temporal trajectory of the target. Furthermore, at the spatial level, we introduced the mamba fusion and background suppression interactive (BSI) modules. These modules establish a dual-stage mechanism for coordinating information interaction and fusion between modalities. Extensive comparisons on five benchmark datasets illustrate that STTrack achieves state-of-the-art performance across various multimodal tracking scenarios. Code is available at: https://github.com/NJU-PCALab/STTrack.

Exploiting Multimodal Spatial-temporal Patterns for Video Object Tracking

TL;DR

A temporal state generator (TSG) that continuously generates a sequence of tokens containing multimodal temporal information that is used to guide the localization of the target in the next time state, establish long-range contextual relationships between video frames, and capture the temporal trajectory of the target.

Abstract

Multimodal tracking has garnered widespread attention as a result of its ability to effectively address the inherent limitations of traditional RGB tracking. However, existing multimodal trackers mainly focus on the fusion and enhancement of spatial features or merely leverage the sparse temporal relationships between video frames. These approaches do not fully exploit the temporal correlations in multimodal videos, making it difficult to capture the dynamic changes and motion information of targets in complex scenarios. To alleviate this problem, we propose a unified multimodal spatial-temporal tracking approach named STTrack. In contrast to previous paradigms that solely relied on updating reference information, we introduced a temporal state generator (TSG) that continuously generates a sequence of tokens containing multimodal temporal information. These temporal information tokens are used to guide the localization of the target in the next time state, establish long-range contextual relationships between video frames, and capture the temporal trajectory of the target. Furthermore, at the spatial level, we introduced the mamba fusion and background suppression interactive (BSI) modules. These modules establish a dual-stage mechanism for coordinating information interaction and fusion between modalities. Extensive comparisons on five benchmark datasets illustrate that STTrack achieves state-of-the-art performance across various multimodal tracking scenarios. Code is available at: https://github.com/NJU-PCALab/STTrack.

Paper Structure

This paper contains 17 sections, 10 equations, 8 figures, 6 tables.

Figures (8)

  • Figure 1: Illustrations of different frameworks of multimodal trackers (a)-(c), and performance comparison (d). (a) Offline multimodal tracker performs offline tracking of video sequences using fixed template frames. (b) Online multimodal tracker is based on an updating strategy, which utilizes the results condition to update the reference information. (c) Our proposed STTrack transmits multimodal temporal information throughout the tracking process. (d) STTrack achieves superior performance against recent state-of-the-art competitors on three popular multimodal tasks.
  • Figure 2: Overall architecture of STTrack. The temporal information tokens of each modality, along with the image tokens, are fed into the vision encoder to guide the extraction of current features using temporal information. In our designed Temporal State Generator, the current temporal tokens are generated based on cross-modal features and previous temporal features. We have added cross modal interaction in Visual Encode. Finally, the features are finely adjusted and fused through the mamba fusion module and then fed into the tracking head to predict the current state.
  • Figure 3: Left: Architecture of the background suppression interactive module. Right: Details of the fusion mamba. In BSI module S is a search areas tokens, Z denotes the template tokens and T is the temporal information tokens.
  • Figure 4: Qualitative comparison between our method and other unified multimodal trackers on three multimodal task. The three sequences correspond to scenarios involving similar object interference, fast motion, and target deformation. Our tracker effectively addresses these challenges through dual optimization in both the temporal and spatial dimensions.
  • Figure 5: Comparison of STTrack and SOTA trackers (including unified trackers and RGB-T trackers) under different attributes in the LasHeR dataset.
  • ...and 3 more figures