Table of Contents
Fetching ...

CADTrack: Learning Contextual Aggregation with Deformable Alignment for Robust RGBT Tracking

Hao Li, Yuhao Wang, Xiantao Hu, Wenning Hao, Pingping Zhang, Dong Wang, Huchuan Lu

TL;DR

CADTrack tackles modality discrepancies in RGBT tracking by integrating three components: MFI with state-space model–based, linear-complexity cross-modal interaction; CAM for dynamic, sparse, cross-layer feature aggregation via Mixture-of-Experts gating; and DAM for deformable, temporally propagated alignment. The approach yields robust, accurate tracking across five benchmarks, delivering state-of-the-art results while maintaining real-time efficiency. Key contributions include bridging RGB and TIR representations with linear-cost interaction, cross-layer context-aware fusion, and modality-specific deformable alignment that mitigates spatial misalignment and drift. The findings demonstrate the practical potential of dynamic, modality-aware feature fusion and alignment for all-weather, all-day tracking applications.

Abstract

RGB-Thermal (RGBT) tracking aims to exploit visible and thermal infrared modalities for robust all-weather object tracking. However, existing RGBT trackers struggle to resolve modality discrepancies, which poses great challenges for robust feature representation. This limitation hinders effective cross-modal information propagation and fusion, which significantly reduces the tracking accuracy. To address this limitation, we propose a novel Contextual Aggregation with Deformable Alignment framework called CADTrack for RGBT Tracking. To be specific, we first deploy the Mamba-based Feature Interaction (MFI) that establishes efficient feature interaction via state space models. This interaction module can operate with linear complexity, reducing computational cost and improving feature discrimination. Then, we propose the Contextual Aggregation Module (CAM) that dynamically activates backbone layers through sparse gating based on the Mixture-of-Experts (MoE). This module can encode complementary contextual information from cross-layer features. Finally, we propose the Deformable Alignment Module (DAM) to integrate deformable sampling and temporal propagation, mitigating spatial misalignment and localization drift. With the above components, our CADTrack achieves robust and accurate tracking in complex scenarios. Extensive experiments on five RGBT tracking benchmarks verify the effectiveness of our proposed method. The source code is released at https://github.com/IdolLab/CADTrack.

CADTrack: Learning Contextual Aggregation with Deformable Alignment for Robust RGBT Tracking

TL;DR

CADTrack tackles modality discrepancies in RGBT tracking by integrating three components: MFI with state-space model–based, linear-complexity cross-modal interaction; CAM for dynamic, sparse, cross-layer feature aggregation via Mixture-of-Experts gating; and DAM for deformable, temporally propagated alignment. The approach yields robust, accurate tracking across five benchmarks, delivering state-of-the-art results while maintaining real-time efficiency. Key contributions include bridging RGB and TIR representations with linear-cost interaction, cross-layer context-aware fusion, and modality-specific deformable alignment that mitigates spatial misalignment and drift. The findings demonstrate the practical potential of dynamic, modality-aware feature fusion and alignment for all-weather, all-day tracking applications.

Abstract

RGB-Thermal (RGBT) tracking aims to exploit visible and thermal infrared modalities for robust all-weather object tracking. However, existing RGBT trackers struggle to resolve modality discrepancies, which poses great challenges for robust feature representation. This limitation hinders effective cross-modal information propagation and fusion, which significantly reduces the tracking accuracy. To address this limitation, we propose a novel Contextual Aggregation with Deformable Alignment framework called CADTrack for RGBT Tracking. To be specific, we first deploy the Mamba-based Feature Interaction (MFI) that establishes efficient feature interaction via state space models. This interaction module can operate with linear complexity, reducing computational cost and improving feature discrimination. Then, we propose the Contextual Aggregation Module (CAM) that dynamically activates backbone layers through sparse gating based on the Mixture-of-Experts (MoE). This module can encode complementary contextual information from cross-layer features. Finally, we propose the Deformable Alignment Module (DAM) to integrate deformable sampling and temporal propagation, mitigating spatial misalignment and localization drift. With the above components, our CADTrack achieves robust and accurate tracking in complex scenarios. Extensive experiments on five RGBT tracking benchmarks verify the effectiveness of our proposed method. The source code is released at https://github.com/IdolLab/CADTrack.

Paper Structure

This paper contains 17 sections, 15 equations, 8 figures, 8 tables.

Figures (8)

  • Figure 1: Comparison with different RGBT tracking paradigms. (a)-(c) The limitations of RGBT tracking include modality discrepancies and spatial misalignment. (d) Existing RGBT trackers exhibit high-complexity feature interaction, use features only from the final layers, and suffer from spatial misalignment. (e) Our framework introduces linear-complexity modality interaction, selects features from multiple layers, and employs deformable alignment.
  • Figure 2: Overall framework of our proposed CADTrack. Firstly, input templates and search regions are tokenized with spatiotemporal alignment cues from previous frames. Then, they are fed into the backbone network with MFI for selective feature interaction. Meanwhile, CAM aggregates multi-level contextual features using modality-specific sparse gating, while DAM generates updated cues through spatial guidance for precise alignment. Finally, a prediction head is used for target localization.
  • Figure 3: Details of our proposed MFI.
  • Figure 4: The structure of our proposed CAM.
  • Figure 5: Deformable alignment of DAM.
  • ...and 3 more figures