Table of Contents
Fetching ...

CSTrack: Enhancing RGB-X Tracking via Compact Spatiotemporal Features

X. Feng, D. Zhang, S. Hu, X. Li, M. Wu, J. Zhang, X. Chen, K. Huang

TL;DR

CSTrack tackles RGB-X tracking by moving away from dual-branch, dispersed feature spaces toward compact spatiotemporal features in a single-branch pipeline. It introduces the Spatial Compact Module (SCM) to fuse RGB and X into a compact spatial representation and the Temporal Compact Module (TCM) to densely capture temporal cues via a target distribution heatmap and a temporal memory. The method achieves state-of-the-art performance on RGB-D/T/E benchmarks and demonstrates strong robustness across challenging scenarios, while maintaining efficiency through token compression and a one-stream backbone. This compact approach offers a scalable, cross-modal tracking framework with potential extensions to additional modalities and future visual-language integrations.

Abstract

Effectively modeling and utilizing spatiotemporal features from RGB and other modalities (\eg, depth, thermal, and event data, denoted as X) is the core of RGB-X tracker design. Existing methods often employ two parallel branches to separately process the RGB and X input streams, requiring the model to simultaneously handle two dispersed feature spaces, which complicates both the model structure and computation process. More critically, intra-modality spatial modeling within each dispersed space incurs substantial computational overhead, limiting resources for inter-modality spatial modeling and temporal modeling. To address this, we propose a novel tracker, CSTrack, which focuses on modeling Compact Spatiotemporal features to achieve simple yet effective tracking. Specifically, we first introduce an innovative Spatial Compact Module that integrates the RGB-X dual input streams into a compact spatial feature, enabling thorough intra- and inter-modality spatial modeling. Additionally, we design an efficient Temporal Compact Module that compactly represents temporal features by constructing the refined target distribution heatmap. Extensive experiments validate the effectiveness of our compact spatiotemporal modeling method, with CSTrack achieving new SOTA results on mainstream RGB-X benchmarks. The code and models will be released at: https://github.com/XiaokunFeng/CSTrack.

CSTrack: Enhancing RGB-X Tracking via Compact Spatiotemporal Features

TL;DR

CSTrack tackles RGB-X tracking by moving away from dual-branch, dispersed feature spaces toward compact spatiotemporal features in a single-branch pipeline. It introduces the Spatial Compact Module (SCM) to fuse RGB and X into a compact spatial representation and the Temporal Compact Module (TCM) to densely capture temporal cues via a target distribution heatmap and a temporal memory. The method achieves state-of-the-art performance on RGB-D/T/E benchmarks and demonstrates strong robustness across challenging scenarios, while maintaining efficiency through token compression and a one-stream backbone. This compact approach offers a scalable, cross-modal tracking framework with potential extensions to additional modalities and future visual-language integrations.

Abstract

Effectively modeling and utilizing spatiotemporal features from RGB and other modalities (\eg, depth, thermal, and event data, denoted as X) is the core of RGB-X tracker design. Existing methods often employ two parallel branches to separately process the RGB and X input streams, requiring the model to simultaneously handle two dispersed feature spaces, which complicates both the model structure and computation process. More critically, intra-modality spatial modeling within each dispersed space incurs substantial computational overhead, limiting resources for inter-modality spatial modeling and temporal modeling. To address this, we propose a novel tracker, CSTrack, which focuses on modeling Compact Spatiotemporal features to achieve simple yet effective tracking. Specifically, we first introduce an innovative Spatial Compact Module that integrates the RGB-X dual input streams into a compact spatial feature, enabling thorough intra- and inter-modality spatial modeling. Additionally, we design an efficient Temporal Compact Module that compactly represents temporal features by constructing the refined target distribution heatmap. Extensive experiments validate the effectiveness of our compact spatiotemporal modeling method, with CSTrack achieving new SOTA results on mainstream RGB-X benchmarks. The code and models will be released at: https://github.com/XiaokunFeng/CSTrack.

Paper Structure

This paper contains 46 sections, 17 equations, 7 figures, 10 tables.

Figures (7)

  • Figure 1: Framework of our proposed CSTrack. Given the RGB and X (e.g., thermal data) input streams at time $t$ ($t \geq 1$), the shared Patch Embedding initially transforms them into token sequences. Then, the Spatial Compact Module integrates them into a compact feature space, which is subsequently fed into a One-stream Backbone for comprehensive spatial modeling. Next, the Temporal Guidance Module uses the previously stored temporal features (up to $t - 1$) for tracking guidance, after which the Head generates the final tracking results. Subsequently, the Temporal Compact Module constructs compact temporal features for the current time step, which are stored for tracking guidance at the next time step $t + 1$.
  • Figure 2: Illustration of different target heatmaps (using an RGB-T sample as an example). (a) Search images and templates, with the green bounding box indicating the target to be tracked. (b-d) Target distribution heatmaps derived from intermediate results, final results, and their combination, i.e., $h^t_i$, $h^t_f$, and $h^t$ (reshaped into 2D images for visualization).
  • Figure 3: Qualitative comparison results of our tracker with other two trackers (i.e., UNTrack and SDSTrack) on three challenging cases. Better viewed in color with zoom-in.
  • Figure 4: Tracking results of the model under different input settings in two categories of cases (using the RGB-T task as an example). (a) Search images and templates. (b-d) Tracking results with only RGB input, only X input, and both inputs. The heatmap regions are cropped by the tracker. The green and red bounding boxes represent the target to be tracked and the tracking result. Better viewed with zoom-in.
  • Figure 5: Model performance variation ($\Delta$ precision) with different temporal lengths.
  • ...and 2 more figures