Table of Contents
Fetching ...

AFter: Attention-based Fusion Router for RGBT Tracking

Andong Lu, Wanyu Wang, Chenglong Li, Jin Tang, Bin Luo

TL;DR

The paper tackles the rigidity of fixed RGB-thermal fusion structures in RGBT tracking by introducing AFter, a dynamic fusion router that uses a hierarchical attention network to adaptively select fusion structures for each frame. AFter's HAN, composed of multiple fusion units and routers across three layers, enables self-, uni-, and bi-directional cross-modal fusion, expanding the fusion structure space. Empirical results across five mainstream datasets demonstrate state-of-the-art performance and robustness under diverse challenges, with code released for reproducibility. This approach provides a practical, adaptable framework for multimodal tracking with potential for broader multimodal fusion tasks.

Abstract

Multi-modal feature fusion as a core investigative component of RGBT tracking emerges numerous fusion studies in recent years. However, existing RGBT tracking methods widely adopt fixed fusion structures to integrate multi-modal feature, which are hard to handle various challenges in dynamic scenarios. To address this problem, this work presents a novel \emph{A}ttention-based \emph{F}usion rou\emph{ter} called AFter, which optimizes the fusion structure to adapt to the dynamic challenging scenarios, for robust RGBT tracking. In particular, we design a fusion structure space based on the hierarchical attention network, each attention-based fusion unit corresponding to a fusion operation and a combination of these attention units corresponding to a fusion structure. Through optimizing the combination of attention-based fusion units, we can dynamically select the fusion structure to adapt to various challenging scenarios. Unlike complex search of different structures in neural architecture search algorithms, we develop a dynamic routing algorithm, which equips each attention-based fusion unit with a router, to predict the combination weights for efficient optimization of the fusion structure. Extensive experiments on five mainstream RGBT tracking datasets demonstrate the superior performance of the proposed AFter against state-of-the-art RGBT trackers. We release the code in https://github.com/Alexadlu/AFter.

AFter: Attention-based Fusion Router for RGBT Tracking

TL;DR

The paper tackles the rigidity of fixed RGB-thermal fusion structures in RGBT tracking by introducing AFter, a dynamic fusion router that uses a hierarchical attention network to adaptively select fusion structures for each frame. AFter's HAN, composed of multiple fusion units and routers across three layers, enables self-, uni-, and bi-directional cross-modal fusion, expanding the fusion structure space. Empirical results across five mainstream datasets demonstrate state-of-the-art performance and robustness under diverse challenges, with code released for reproducibility. This approach provides a practical, adaptable framework for multimodal tracking with potential for broader multimodal fusion tasks.

Abstract

Multi-modal feature fusion as a core investigative component of RGBT tracking emerges numerous fusion studies in recent years. However, existing RGBT tracking methods widely adopt fixed fusion structures to integrate multi-modal feature, which are hard to handle various challenges in dynamic scenarios. To address this problem, this work presents a novel \emph{A}ttention-based \emph{F}usion rou\emph{ter} called AFter, which optimizes the fusion structure to adapt to the dynamic challenging scenarios, for robust RGBT tracking. In particular, we design a fusion structure space based on the hierarchical attention network, each attention-based fusion unit corresponding to a fusion operation and a combination of these attention units corresponding to a fusion structure. Through optimizing the combination of attention-based fusion units, we can dynamically select the fusion structure to adapt to various challenging scenarios. Unlike complex search of different structures in neural architecture search algorithms, we develop a dynamic routing algorithm, which equips each attention-based fusion unit with a router, to predict the combination weights for efficient optimization of the fusion structure. Extensive experiments on five mainstream RGBT tracking datasets demonstrate the superior performance of the proposed AFter against state-of-the-art RGBT trackers. We release the code in https://github.com/Alexadlu/AFter.
Paper Structure (16 sections, 7 equations, 7 figures, 4 tables)

This paper contains 16 sections, 7 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Comparison of existing RGBT tracking models. (a), (b) and (c) indicate representation model, feature fusion model and decision fusion model. (d) denotes the proposed attention-based fusion router. (e) represents the performance comparison between our proposed method and state-of-the-art RGBT tracking methods on precision (PR) scores across five RGBT tracking datasets.
  • Figure 2: Overall network architecture of AFter. Illustration of each fusion unit and router. ECA and SGE denote a channel attention wang2020eca and a spatial attention li2019spatial
  • Figure 3: Precision Rate (PR) and Success Rate (SR) of challenge attributes on the RGBT234 dataset.
  • Figure 4: Visualization of fusion structures in HAN under different scenarios.
  • Figure 5: Comparison of feature maps between ToMP$_{RGBT}$ (third column) and HAN-based ToMP$_{RGBT}$ (fourth column).
  • ...and 2 more figures