Table of Contents
Fetching ...

UAUTrack: Towards Unified Multimodal Anti-UAV Visual Tracking

Qionglin Ren, Dawei Zhang, Chunxu Tian, Dan Zhang

TL;DR

UAUTrack introduces a unified, single-stream multimodal framework for Anti-UAV tracking that jointly processes RGB, TIR, and RGB-T data. It leverages a Text Prior Prompt to semantically guide tracking without external detectors and fuses modalities through a transformer-based backbone with cross-modal attention. The approach delivers state-of-the-art results on Anti-UAV, Anti-UAV410, and DUT Anti-UAV benchmarks while maintaining real-time efficiency, thanks in part to online template updating and end-to-end fine-tuning. Ablation studies validate the effectiveness of full fine-tuning, TPP, and multimodal fusion. Overall, the work demonstrates the practicality and robustness of unified multimodal Anti-UAV tracking for varied operational scenarios.

Abstract

Research in Anti-UAV (Unmanned Aerial Vehicle) tracking has explored various modalities, including RGB, TIR, and RGB-T fusion. However, a unified framework for cross-modal collaboration is still lacking. Existing approaches have primarily focused on independent models for individual tasks, often overlooking the potential for cross-modal information sharing. Furthermore, Anti-UAV tracking techniques are still in their infancy, with current solutions struggling to achieve effective multimodal data fusion. To address these challenges, we propose UAUTrack, a unified single-target tracking framework built upon a single-stream, single-stage, end-to-end architecture that effectively integrates multiple modalities. UAUTrack introduces a key component: a text prior prompt strategy that directs the model to focus on UAVs across various scenarios. Experimental results show that UAUTrack achieves state-of-the-art performance on the Anti-UAV and DUT Anti-UAV datasets, and maintains a favourable trade-off between accuracy and speed on the Anti-UAV410 dataset, demonstrating both high accuracy and practical efficiency across diverse Anti-UAV scenarios.

UAUTrack: Towards Unified Multimodal Anti-UAV Visual Tracking

TL;DR

UAUTrack introduces a unified, single-stream multimodal framework for Anti-UAV tracking that jointly processes RGB, TIR, and RGB-T data. It leverages a Text Prior Prompt to semantically guide tracking without external detectors and fuses modalities through a transformer-based backbone with cross-modal attention. The approach delivers state-of-the-art results on Anti-UAV, Anti-UAV410, and DUT Anti-UAV benchmarks while maintaining real-time efficiency, thanks in part to online template updating and end-to-end fine-tuning. Ablation studies validate the effectiveness of full fine-tuning, TPP, and multimodal fusion. Overall, the work demonstrates the practicality and robustness of unified multimodal Anti-UAV tracking for varied operational scenarios.

Abstract

Research in Anti-UAV (Unmanned Aerial Vehicle) tracking has explored various modalities, including RGB, TIR, and RGB-T fusion. However, a unified framework for cross-modal collaboration is still lacking. Existing approaches have primarily focused on independent models for individual tasks, often overlooking the potential for cross-modal information sharing. Furthermore, Anti-UAV tracking techniques are still in their infancy, with current solutions struggling to achieve effective multimodal data fusion. To address these challenges, we propose UAUTrack, a unified single-target tracking framework built upon a single-stream, single-stage, end-to-end architecture that effectively integrates multiple modalities. UAUTrack introduces a key component: a text prior prompt strategy that directs the model to focus on UAVs across various scenarios. Experimental results show that UAUTrack achieves state-of-the-art performance on the Anti-UAV and DUT Anti-UAV datasets, and maintains a favourable trade-off between accuracy and speed on the Anti-UAV410 dataset, demonstrating both high accuracy and practical efficiency across diverse Anti-UAV scenarios.

Paper Structure

This paper contains 12 sections, 8 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Differences between our Anti-UAV tracking approach and previous methods. (a) Siamese-based trackers. (b) Detection-based trackers. (c) Hybrid trackers combining detection and Siamese networks. (d) Our unified end-to-end one-stream method. Any-Modality refers to the RGB, TIR, and RGB-T modalities.
  • Figure 2: Overall architecture of UAUTrack. A unified token embedding is employed to represent different modalities, including visible, thermal, and RGB-Thermal inputs. Text prior prompt (TPP) strategy are generated and fed into the transformer encoder, where they are fused with search and template tokens from each modality.
  • Figure 3: Qualitative visualization of common challenges in Anti-UAV tracking, including (a)TIR Thermal Crossover (TC), (b)RGB TC, (c)TIR Scale Variation (SV), (d)TIR Low Resolution (LR), (e)RGB LR, (f)TIR Fast Motion (FM).