X Modality Assisting RGBT Object Tracking

Zhaisheng Ding; Haiyan Li; Ruichao Hou; Yanyu Liu; Shidong Xie

X Modality Assisting RGBT Object Tracking

Zhaisheng Ding, Haiyan Li, Ruichao Hou, Yanyu Liu, Shidong Xie

TL;DR

This work tackles robust RGBT tracking by addressing cross-modal discrepancies through X-Net, a three-level fusion framework. It introduces a pixel-level generation module (PGM) that uses self-knowledge distillation to synthesize an X modality, a feature-level interaction module (FIM) combining a spatial-dimensional feature translation strategy and a mixed feature interaction transformer, and a decision-level refinement module (DRM) that adaptively harnesses optical flow or a refinement network for precise re-localization. The approach yields measurable gains across GTOT, RGBT234, and LasHeR benchmarks, with reported improvements such as $0.47\%$ and $1.2\%$ on average PR/SR, and demonstrates competitive efficiency (≈$21$ fps) and favorable complexity. Overall, X-Net advances multi-modal tracking by effectively fusing heterogeneous cues and robust online refinement, with public code available at the project page.

Abstract

Developing robust multi-modal feature representations is crucial for enhancing object tracking performance. In pursuit of this objective, a novel X Modality Assisting Network (X-Net) is introduced, which explores the impact of the fusion paradigm by decoupling visual object tracking into three distinct levels, thereby facilitating subsequent processing. Initially, to overcome the challenges associated with feature learning due to significant discrepancies between RGB and thermal modalities, a plug-and-play pixel-level generation module (PGM) based on knowledge distillation learning is proposed. This module effectively generates the X modality, bridging the gap between the two patterns while minimizing noise interference. Subsequently, to optimize sample feature representation and promote cross-modal interactions, a feature-level interaction module (FIM) is introduced, integrating a mixed feature interaction transformer and a spatial dimensional feature translation strategy. Finally, to address random drifting caused by missing instance features, a flexible online optimization strategy called the decision-level refinement module (DRM) is proposed, which incorporates optical flow and refinement mechanisms. The efficacy of X-Net is validated through experiments on three benchmarks, demonstrating its superiority over state-of-the-art trackers. Notably, X-Net achieves performance gains of 0.47%/1.2% in the average of precise rate and success rate, respectively. Additionally, the research content, data, and code are pledged to be made publicly accessible at https://github.com/DZSYUNNAN/XNet.

X Modality Assisting RGBT Object Tracking

TL;DR

and

on average PR/SR, and demonstrates competitive efficiency (≈

fps) and favorable complexity. Overall, X-Net advances multi-modal tracking by effectively fusing heterogeneous cues and robust online refinement, with public code available at the project page.

Abstract

Paper Structure (18 sections, 6 equations, 12 figures, 7 tables, 1 algorithm)

This paper contains 18 sections, 6 equations, 12 figures, 7 tables, 1 algorithm.

Introduction
Related Work
Deep Learning-based RGBT Tracking Methods
RGBT information fusion based on knowledge distillation
Refinement Mechanism for RGBT Tracking
Methodology
Overview
X Modality Assisting Network
Network Training
Online Tracking
Experimental Results
Datasets and Metrics
Evaluation on GTOT Dataset
Evaluation on RGBT234 Dataset
Evaluation on LasHeR Dataset
...and 3 more sections

Figures (12)

Figure 1: Compared with existing RT-MDNet-based trackers. ${D_{rgb}},{D_t},{D_x}$ denote the deep features of RGB, Thermal and X modality.
Figure 2: Overview of the proposed X-Net framework.
Figure 3: The details of PGM. $N$ denotes the noise interference.
Figure 4: Illustration of the effectiveness of feature attention in FIM. (a) RGB images, (b) Thermal images, heatmap of features (c) without FIM, (d) with FIM.
Figure 5: Comparison results of X-Net against the state-of-the-art trackers. Challenge-based performance is evaluated by PR/SR scores ($\%$) and produced on GTOT.
...and 7 more figures

X Modality Assisting RGBT Object Tracking

TL;DR

Abstract

X Modality Assisting RGBT Object Tracking

Authors

TL;DR

Abstract

Table of Contents

Figures (12)