Table of Contents
Fetching ...

How Far are Modern Trackers from UAV-Anti-UAV? A Million-Scale Benchmark and New Baseline

Chunhui Zhang, Li Liu, Zhipeng Zhang, Yong Wang, Hao Wen, Xi Zhou, Shiming Ge, Yanfeng Wang

TL;DR

The paper defines UAV-Anti-UAV tracking, a new air-to-air visual task where a pursuer UAV tracks an adversarial target under dual dynamics. It introduces a million-scale benchmark (1,810 videos, ~1.05M frames) with bounding boxes, language prompts, and 15 attributes, and proposes MambaSTS to fuse spatial, temporal, and semantic cues via a unidirectional state-space module. Across 50 trackers, the study reveals substantial gaps between current methods and the demands of real aerial confrontation, while showing MambaSTS achieves state-of-the-art performance and generalizes to related UAV and Anti-UAV benchmarks. The work provides a practical, multi-modal testbed and a strong baseline to drive future progress in robust, real-time anti-UAV tracking systems.

Abstract

Unmanned Aerial Vehicles (UAVs) offer wide-ranging applications but also pose significant safety and privacy violation risks in areas like airport and infrastructure inspection, spurring the rapid development of Anti-UAV technologies in recent years. However, current Anti-UAV research primarily focuses on RGB, infrared (IR), or RGB-IR videos captured by fixed ground cameras, with little attention to tracking target UAVs from another moving UAV platform. To fill this gap, we propose a new multi-modal visual tracking task termed UAV-Anti-UAV, which involves a pursuer UAV tracking a target adversarial UAV in the video stream. Compared to existing Anti-UAV tasks, UAV-Anti-UAV is more challenging due to severe dual-dynamic disturbances caused by the rapid motion of both the capturing platform and the target. To advance research in this domain, we construct a million-scale dataset consisting of 1,810 videos, each manually annotated with bounding boxes, a language prompt, and 15 tracking attributes. Furthermore, we propose MambaSTS, a Mamba-based baseline method for UAV-Anti-UAV tracking, which enables integrated spatial-temporal-semantic learning. Specifically, we employ Mamba and Transformer models to learn global semantic and spatial features, respectively, and leverage the state space model's strength in long-sequence modeling to establish video-level long-term context via a temporal token propagation mechanism. We conduct experiments on the UAV-Anti-UAV dataset to validate the effectiveness of our method. A thorough experimental evaluation of 50 modern deep tracking algorithms demonstrates that there is still significant room for improvement in the UAV-Anti-UAV domain. The dataset and codes will be available at {\color{magenta}https://github.com/983632847/Awesome-Multimodal-Object-Tracking}.

How Far are Modern Trackers from UAV-Anti-UAV? A Million-Scale Benchmark and New Baseline

TL;DR

The paper defines UAV-Anti-UAV tracking, a new air-to-air visual task where a pursuer UAV tracks an adversarial target under dual dynamics. It introduces a million-scale benchmark (1,810 videos, ~1.05M frames) with bounding boxes, language prompts, and 15 attributes, and proposes MambaSTS to fuse spatial, temporal, and semantic cues via a unidirectional state-space module. Across 50 trackers, the study reveals substantial gaps between current methods and the demands of real aerial confrontation, while showing MambaSTS achieves state-of-the-art performance and generalizes to related UAV and Anti-UAV benchmarks. The work provides a practical, multi-modal testbed and a strong baseline to drive future progress in robust, real-time anti-UAV tracking systems.

Abstract

Unmanned Aerial Vehicles (UAVs) offer wide-ranging applications but also pose significant safety and privacy violation risks in areas like airport and infrastructure inspection, spurring the rapid development of Anti-UAV technologies in recent years. However, current Anti-UAV research primarily focuses on RGB, infrared (IR), or RGB-IR videos captured by fixed ground cameras, with little attention to tracking target UAVs from another moving UAV platform. To fill this gap, we propose a new multi-modal visual tracking task termed UAV-Anti-UAV, which involves a pursuer UAV tracking a target adversarial UAV in the video stream. Compared to existing Anti-UAV tasks, UAV-Anti-UAV is more challenging due to severe dual-dynamic disturbances caused by the rapid motion of both the capturing platform and the target. To advance research in this domain, we construct a million-scale dataset consisting of 1,810 videos, each manually annotated with bounding boxes, a language prompt, and 15 tracking attributes. Furthermore, we propose MambaSTS, a Mamba-based baseline method for UAV-Anti-UAV tracking, which enables integrated spatial-temporal-semantic learning. Specifically, we employ Mamba and Transformer models to learn global semantic and spatial features, respectively, and leverage the state space model's strength in long-sequence modeling to establish video-level long-term context via a temporal token propagation mechanism. We conduct experiments on the UAV-Anti-UAV dataset to validate the effectiveness of our method. A thorough experimental evaluation of 50 modern deep tracking algorithms demonstrates that there is still significant room for improvement in the UAV-Anti-UAV domain. The dataset and codes will be available at {\color{magenta}https://github.com/983632847/Awesome-Multimodal-Object-Tracking}.

Paper Structure

This paper contains 27 sections, 9 equations, 15 figures, 5 tables.

Figures (15)

  • Figure 1: Comparison of three distinct UAV-related visual tracking tasks. (a) UAV Tracking: A UAV tracks ground targets ( e.g., cars, pedestrians), characterized by top-down views and scale variations. (b) Anti-UAV: A ground-based camera tracks an airborne UAV, often facing cluttered sky backgrounds and tiny targets. (c) Proposed UAV-Anti-UAV: A chasing UAV tracks a target UAV. This task involves highly dynamic relative motion and erratic background changes due to the rapid movement of both the platform and the target.
  • Figure 2: Representative examples from the UAV-Anti-UAV benchmark dataset. The dataset contains five distinct categories of target UAVs: fixed-wing, first-person view (FPV), multi-rotor, vertical take-off and landing (VTOL), and unmanned helicopter. Each example is annotated with bounding boxes and a corresponding language prompt describing the target and its environment.
  • Figure 3: Distribution of videos for each tracking attribute.
  • Figure 4: Distribution of brightness values for the proposed dataset versus existing benchmarks. The average brightness of each dataset is provided in the legend.
  • Figure 5: Comparison of relative speed distributions between the proposed UAV-Anti-UAV dataset and existing UAV tracking and Anti-UAV datasets. The legend displays the average relative speed of each dataset.
  • ...and 10 more figures