CSTrack: Enhancing RGB-X Tracking via Compact Spatiotemporal Features
X. Feng, D. Zhang, S. Hu, X. Li, M. Wu, J. Zhang, X. Chen, K. Huang
TL;DR
CSTrack tackles RGB-X tracking by moving away from dual-branch, dispersed feature spaces toward compact spatiotemporal features in a single-branch pipeline. It introduces the Spatial Compact Module (SCM) to fuse RGB and X into a compact spatial representation and the Temporal Compact Module (TCM) to densely capture temporal cues via a target distribution heatmap and a temporal memory. The method achieves state-of-the-art performance on RGB-D/T/E benchmarks and demonstrates strong robustness across challenging scenarios, while maintaining efficiency through token compression and a one-stream backbone. This compact approach offers a scalable, cross-modal tracking framework with potential extensions to additional modalities and future visual-language integrations.
Abstract
Effectively modeling and utilizing spatiotemporal features from RGB and other modalities (\eg, depth, thermal, and event data, denoted as X) is the core of RGB-X tracker design. Existing methods often employ two parallel branches to separately process the RGB and X input streams, requiring the model to simultaneously handle two dispersed feature spaces, which complicates both the model structure and computation process. More critically, intra-modality spatial modeling within each dispersed space incurs substantial computational overhead, limiting resources for inter-modality spatial modeling and temporal modeling. To address this, we propose a novel tracker, CSTrack, which focuses on modeling Compact Spatiotemporal features to achieve simple yet effective tracking. Specifically, we first introduce an innovative Spatial Compact Module that integrates the RGB-X dual input streams into a compact spatial feature, enabling thorough intra- and inter-modality spatial modeling. Additionally, we design an efficient Temporal Compact Module that compactly represents temporal features by constructing the refined target distribution heatmap. Extensive experiments validate the effectiveness of our compact spatiotemporal modeling method, with CSTrack achieving new SOTA results on mainstream RGB-X benchmarks. The code and models will be released at: https://github.com/XiaokunFeng/CSTrack.
