Table of Contents
Fetching ...

Dynamic Disentangled Fusion Network for RGBT Tracking

Chenglong Li, Tao Wang, Zhaodong Ding, Yun Xiao, Jin Tang

TL;DR

This work tackles robust RGBT tracking under diverse and dynamic challenges by introducing Dynamic Disentangled Fusion Network (DDFNet), which disentangles multimodal fusion into six attribute-specific dynamic branches plus a general branch. Each branch is formed from router-guided fusion units (SCFU and SFU) to adaptively fuse RGB and TIR features, while an Adaptive Aggregation Fusion Module selects and weights the active branches and a Lightweight Enhancement Fusion Module strengthens the fused representations. A three-stage training procedure, along with LasHeR-based data generation, enables effective training of the dynamic fusion structure and its components. Empirical results on GTOT, RGBT210, RGBT234, and LasHeR show state-of-the-art performance with notable gains across multiple challenge attributes, illustrating improved robustness and generalization in multimodal tracking. The approach offers a practical pathway to reliable RGBT tracking in real-world scenarios by leveraging adaptive, attribute-aware fusion without heavy reliance on large-scale cross-modal data.

Abstract

RGBT tracking usually suffers from various challenging factors of low resolution, similar appearance, extreme illumination, thermal crossover and occlusion, to name a few. Existing works often study complex fusion models to handle challenging scenarios, but can not well adapt to various challenges, which might limit tracking performance. To handle this problem, we propose a novel Dynamic Disentangled Fusion Network called DDFNet, which disentangles the fusion process into several dynamic fusion models via the challenge attributes to adapt to various challenging scenarios, for robust RGBT tracking. In particular, we design six attribute-based fusion models to integrate RGB and thermal features under the six challenging scenarios respectively.Since each fusion model is to deal with the corresponding challenges, such disentangled fusion scheme could increase the fusion capacity without the dependence on large-scale training data. Considering that every challenging scenario also has different levels of difficulty, we propose to optimize the combination of multiple fusion units to form each attribute-based fusion model in a dynamic manner, which could well adapt to the difficulty of the corresponding challenging scenario. To address the issue that which fusion models should be activated in the tracking process, we design an adaptive aggregation fusion module to integrate all features from attribute-based fusion models in an adaptive manner with a three-stage training algorithm. In addition, we design an enhancement fusion module to further strengthen the aggregated feature and modality-specific features. Experimental results on benchmark datasets demonstrate the effectiveness of our DDFNet against other state-of-the-art methods.

Dynamic Disentangled Fusion Network for RGBT Tracking

TL;DR

This work tackles robust RGBT tracking under diverse and dynamic challenges by introducing Dynamic Disentangled Fusion Network (DDFNet), which disentangles multimodal fusion into six attribute-specific dynamic branches plus a general branch. Each branch is formed from router-guided fusion units (SCFU and SFU) to adaptively fuse RGB and TIR features, while an Adaptive Aggregation Fusion Module selects and weights the active branches and a Lightweight Enhancement Fusion Module strengthens the fused representations. A three-stage training procedure, along with LasHeR-based data generation, enables effective training of the dynamic fusion structure and its components. Empirical results on GTOT, RGBT210, RGBT234, and LasHeR show state-of-the-art performance with notable gains across multiple challenge attributes, illustrating improved robustness and generalization in multimodal tracking. The approach offers a practical pathway to reliable RGBT tracking in real-world scenarios by leveraging adaptive, attribute-aware fusion without heavy reliance on large-scale cross-modal data.

Abstract

RGBT tracking usually suffers from various challenging factors of low resolution, similar appearance, extreme illumination, thermal crossover and occlusion, to name a few. Existing works often study complex fusion models to handle challenging scenarios, but can not well adapt to various challenges, which might limit tracking performance. To handle this problem, we propose a novel Dynamic Disentangled Fusion Network called DDFNet, which disentangles the fusion process into several dynamic fusion models via the challenge attributes to adapt to various challenging scenarios, for robust RGBT tracking. In particular, we design six attribute-based fusion models to integrate RGB and thermal features under the six challenging scenarios respectively.Since each fusion model is to deal with the corresponding challenges, such disentangled fusion scheme could increase the fusion capacity without the dependence on large-scale training data. Considering that every challenging scenario also has different levels of difficulty, we propose to optimize the combination of multiple fusion units to form each attribute-based fusion model in a dynamic manner, which could well adapt to the difficulty of the corresponding challenging scenario. To address the issue that which fusion models should be activated in the tracking process, we design an adaptive aggregation fusion module to integrate all features from attribute-based fusion models in an adaptive manner with a three-stage training algorithm. In addition, we design an enhancement fusion module to further strengthen the aggregated feature and modality-specific features. Experimental results on benchmark datasets demonstrate the effectiveness of our DDFNet against other state-of-the-art methods.

Paper Structure

This paper contains 14 sections, 8 equations, 9 figures, 5 tables.

Figures (9)

  • Figure 1: Comparison of our dynamic disentangled fusion model with existing methods. The common fusion models (a) tend to design a complex single-branch fusion network. In the existing attribute-based appearance disentanglement models (b) which extract appearance features under certain attributes and then perform feature fusion, but each branch has a fixed structure. In our DDFNet (c), each dynamic fusion branch dynamically selects fusion units to compose the fusion structure according to the challenge scenario, and this design can better perform effective fusion under the corresponding challenge attributes.
  • Figure 2: The proposed dynamic disentangled fusion network. The EFM denotes the lightweight enhancement fusion module. The acronyms IE, TC, OCC, LR, SA, and GEN stand for the dynamic fusion branches based on extreme illumination, thermal crossover, occlusion, low resolution, similar appearance, and general attributes respectively. The detailed structure of the Adaptive Aggregation Fusion Module (AFM) is shown in the network.
  • Figure 3: The dynamic fusion branch is comprised of the Spatial and Channel Fusion Unit (SCFU), Selective Fusion Unit (SFU), and a router. The structures of SCFU and SFU are shown in (a) and (b), and the structure of the router is shown in (c). Herein, SCFU is composed of Spatial Attention Enhancement Module (SAE) and Channel Attention Enhancement Module (CAE), whose detailed designs are shown above.
  • Figure 4: Visualization of the dynamic structure changes of the dynamic fusion branches in challenge scenarios.
  • Figure 5: Feature map visualization of the attribute fusion features in dynamic fusion branches, the aggregated features in the adaptive aggregation fusion module, and the enhanced features in the lightweight enhancement fusion module.
  • ...and 4 more figures