Table of Contents
Fetching ...

Infrared UAV Target Tracking with Dynamic Feature Refinement and Global Contextual Attention Knowledge Distillation

Houzhang Fang, Chenxing Wu, Kun Bai, Tianqi Chen, Xiaolin Wang, Xiyang Liu, Yi Chang, Luxin Yan

TL;DR

The paper tackles infrared UAV target tracking under challenging conditions by introducing SiamDFF, a Siamese tracker augmented with selective target enhancement, dual-branch spatial feature aggregation, and context-aware template fusion. It adds a tracking-specific knowledge distiller (TCAKD) to transfer global contextual attention from a teacher to a lightweight student backbone, boosting feature extraction without sacrificing speed. Empirical results on real infrared UAV datasets show state-of-the-art performance across multiple metrics and strong real-time capabilities, with thorough ablations validating each component’s contribution. The work advances IRUT tracking by combining targeted cross-attention refinement, local-global feature fusion, and distillation of global context, yielding robust performance in cluttered, multi-scale scenarios with efficient inference.

Abstract

Unmanned aerial vehicle (UAV) target tracking based on thermal infrared imaging has been one of the most important sensing technologies in anti-UAV applications. However, the infrared UAV targets often exhibit weak features and complex backgrounds, posing significant challenges to accurate tracking. To address these problems, we introduce SiamDFF, a novel dynamic feature fusion Siamese network that integrates feature enhancement and global contextual attention knowledge distillation for infrared UAV target (IRUT) tracking. The SiamDFF incorporates a selective target enhancement network (STEN), a dynamic spatial feature aggregation module (DSFAM), and a dynamic channel feature aggregation module (DCFAM). The STEN employs intensity-aware multi-head cross-attention to adaptively enhance important regions for both template and search branches. The DSFAM enhances multi-scale UAV target features by integrating local details with global features, utilizing spatial attention guidance within the search frame. The DCFAM effectively integrates the mixed template generated from STEN in the template branch and original template, avoiding excessive background interference with the template and thereby enhancing the emphasis on UAV target region features within the search frame. Furthermore, to enhance the feature extraction capabilities of the network for IRUT without adding extra computational burden, we propose a novel tracking-specific target-aware contextual attention knowledge distiller. It transfers the target prior from the teacher network to the student model, significantly improving the student network's focus on informative regions at each hierarchical level of the backbone network. Extensive experiments on real infrared UAV datasets demonstrate that the proposed approach outperforms state-of-the-art target trackers under complex backgrounds while achieving a real-time tracking speed.

Infrared UAV Target Tracking with Dynamic Feature Refinement and Global Contextual Attention Knowledge Distillation

TL;DR

The paper tackles infrared UAV target tracking under challenging conditions by introducing SiamDFF, a Siamese tracker augmented with selective target enhancement, dual-branch spatial feature aggregation, and context-aware template fusion. It adds a tracking-specific knowledge distiller (TCAKD) to transfer global contextual attention from a teacher to a lightweight student backbone, boosting feature extraction without sacrificing speed. Empirical results on real infrared UAV datasets show state-of-the-art performance across multiple metrics and strong real-time capabilities, with thorough ablations validating each component’s contribution. The work advances IRUT tracking by combining targeted cross-attention refinement, local-global feature fusion, and distillation of global context, yielding robust performance in cluttered, multi-scale scenarios with efficient inference.

Abstract

Unmanned aerial vehicle (UAV) target tracking based on thermal infrared imaging has been one of the most important sensing technologies in anti-UAV applications. However, the infrared UAV targets often exhibit weak features and complex backgrounds, posing significant challenges to accurate tracking. To address these problems, we introduce SiamDFF, a novel dynamic feature fusion Siamese network that integrates feature enhancement and global contextual attention knowledge distillation for infrared UAV target (IRUT) tracking. The SiamDFF incorporates a selective target enhancement network (STEN), a dynamic spatial feature aggregation module (DSFAM), and a dynamic channel feature aggregation module (DCFAM). The STEN employs intensity-aware multi-head cross-attention to adaptively enhance important regions for both template and search branches. The DSFAM enhances multi-scale UAV target features by integrating local details with global features, utilizing spatial attention guidance within the search frame. The DCFAM effectively integrates the mixed template generated from STEN in the template branch and original template, avoiding excessive background interference with the template and thereby enhancing the emphasis on UAV target region features within the search frame. Furthermore, to enhance the feature extraction capabilities of the network for IRUT without adding extra computational burden, we propose a novel tracking-specific target-aware contextual attention knowledge distiller. It transfers the target prior from the teacher network to the student model, significantly improving the student network's focus on informative regions at each hierarchical level of the backbone network. Extensive experiments on real infrared UAV datasets demonstrate that the proposed approach outperforms state-of-the-art target trackers under complex backgrounds while achieving a real-time tracking speed.

Paper Structure

This paper contains 34 sections, 14 equations, 15 figures, 15 tables.

Figures (15)

  • Figure 1: Qualitative comparison of our method with the baseline SiamYOLO 2021ICCVWFang and MixFormer cuimixformer on two challenging video sequences (complex background and small target). The proposed Siamese tracker demonstrates superior performance in infrared UAV target tracking across various complex backgrounds due to its dynamic feature enhancement and global contextual attention-based knowledge distillation.
  • Figure 1: Representative images of ten sequences selected as the dataset for training and testing. Each row presents six images from each sequence, respectively. The targets are bounded using green boxes, and close-ups are given in the left-bottom or right-bottom corner of each image.
  • Figure 2: Overview of the proposed SiamDFF. The feature interaction module enhances search frame features to improve the ability of the model for target classification and localization, which includes the proposed STEN, DSFAM, and DCFAM. The ROI denotes region of interest.
  • Figure 2: Qualitative tracking results of five methods on five challenging sequences from Seq. 1-3, Seq. 6, and Seq. 8. Each row presents a sequence with five representative images. Close-ups are given in the left-bottom for better visualization.
  • Figure 3: (a) Structure of selective target enhancement network (STEN). (b) Structure of intensity-aware multi-head cross-attention (IMC). $\raisebox{0.1pt}{\textcircled{\tiny M}}$ denotes the matrix multiplication.
  • ...and 10 more figures