Table of Contents
Fetching ...

A Tri-Modal Dataset and a Baseline System for Tracking Unmanned Aerial Vehicles

Tianyang Xu, Jinjie Gu, Xuefeng Zhu, XiaoJun Wu, Josef Kittler

TL;DR

This work tackles the fragility of vision-based UAV tracking under challenging conditions by introducing MM-UAV, the first large-scale tri-modal UAV tracking benchmark (RGB, IR, and Event) with 1,321 sequences and about 2.8 million frames per modality. It also presents a baseline multi-modal tracker, MMA-SORT, featuring an Offset-Guided Adaptive Alignment (OGAA), an Adaptive Dynamic Fusion Module (ADFM), and an event-driven motion embedding to enhance identity maintenance in multi-UAV scenarios. The dataset provides rigorous annotations (including independent RGB/IR labels and seven challenging attributes) and extensive statistics to support robust evaluation, while MMA-SORT demonstrates significant performance gains over state-of-the-art unimodal and multi-object trackers, especially in low light and fast-motion cases. The work offers a practical foundation for future research in multi-modal UAV tracking and enables broader exploration of cross-modal fusion and motion-aware association in autonomous anti-UAV systems.

Abstract

With the proliferation of low altitude unmanned aerial vehicles (UAVs), visual multi-object tracking is becoming a critical security technology, demanding significant robustness even in complex environmental conditions. However, tracking UAVs using a single visual modality often fails in challenging scenarios, such as low illumination, cluttered backgrounds, and rapid motion. Although multi-modal multi-object UAV tracking is more resilient, the development of effective solutions has been hindered by the absence of dedicated public datasets. To bridge this gap, we release MM-UAV, the first large-scale benchmark for Multi-Modal UAV Tracking, integrating three key sensing modalities, e.g. RGB, infrared (IR), and event signals. The dataset spans over 30 challenging scenarios, with 1,321 synchronised multi-modal sequences, and more than 2.8 million annotated frames. Accompanying the dataset, we provide a novel multi-modal multi-UAV tracking framework, designed specifically for UAV tracking applications and serving as a baseline for future research. Our framework incorporates two key technical innovations, e.g. an offset-guided adaptive alignment module to resolve spatio mismatches across sensors, and an adaptive dynamic fusion module to balance complementary information conveyed by different modalities. Furthermore, to overcome the limitations of conventional appearance modelling in multi-object tracking, we introduce an event-enhanced association mechanism that leverages motion cues from the event modality for more reliable identity maintenance. Comprehensive experiments demonstrate that the proposed framework consistently outperforms state-of-the-art methods. To foster further research in multi-modal UAV tracking, both the dataset and source code will be made publicly available at https://xuefeng-zhu5.github.io/MM-UAV/.

A Tri-Modal Dataset and a Baseline System for Tracking Unmanned Aerial Vehicles

TL;DR

This work tackles the fragility of vision-based UAV tracking under challenging conditions by introducing MM-UAV, the first large-scale tri-modal UAV tracking benchmark (RGB, IR, and Event) with 1,321 sequences and about 2.8 million frames per modality. It also presents a baseline multi-modal tracker, MMA-SORT, featuring an Offset-Guided Adaptive Alignment (OGAA), an Adaptive Dynamic Fusion Module (ADFM), and an event-driven motion embedding to enhance identity maintenance in multi-UAV scenarios. The dataset provides rigorous annotations (including independent RGB/IR labels and seven challenging attributes) and extensive statistics to support robust evaluation, while MMA-SORT demonstrates significant performance gains over state-of-the-art unimodal and multi-object trackers, especially in low light and fast-motion cases. The work offers a practical foundation for future research in multi-modal UAV tracking and enables broader exploration of cross-modal fusion and motion-aware association in autonomous anti-UAV systems.

Abstract

With the proliferation of low altitude unmanned aerial vehicles (UAVs), visual multi-object tracking is becoming a critical security technology, demanding significant robustness even in complex environmental conditions. However, tracking UAVs using a single visual modality often fails in challenging scenarios, such as low illumination, cluttered backgrounds, and rapid motion. Although multi-modal multi-object UAV tracking is more resilient, the development of effective solutions has been hindered by the absence of dedicated public datasets. To bridge this gap, we release MM-UAV, the first large-scale benchmark for Multi-Modal UAV Tracking, integrating three key sensing modalities, e.g. RGB, infrared (IR), and event signals. The dataset spans over 30 challenging scenarios, with 1,321 synchronised multi-modal sequences, and more than 2.8 million annotated frames. Accompanying the dataset, we provide a novel multi-modal multi-UAV tracking framework, designed specifically for UAV tracking applications and serving as a baseline for future research. Our framework incorporates two key technical innovations, e.g. an offset-guided adaptive alignment module to resolve spatio mismatches across sensors, and an adaptive dynamic fusion module to balance complementary information conveyed by different modalities. Furthermore, to overcome the limitations of conventional appearance modelling in multi-object tracking, we introduce an event-enhanced association mechanism that leverages motion cues from the event modality for more reliable identity maintenance. Comprehensive experiments demonstrate that the proposed framework consistently outperforms state-of-the-art methods. To foster further research in multi-modal UAV tracking, both the dataset and source code will be made publicly available at https://xuefeng-zhu5.github.io/MM-UAV/.

Paper Structure

This paper contains 21 sections, 12 equations, 14 figures, 4 tables.

Figures (14)

  • Figure 1: An example of the released MM-UAV dataset. The respective imaging characteristics of different sensors provide modal complementarity that is critical for robust operation in different scenarios. Different modalities exhibit varying degrees of visual misalignment. For instance, in low-light scenarios, the IR (infrared) modality can provide more discriminative clues.
  • Figure 2: Temporal distribution. We analyse the temporal distribution of sequences in the training set and in the test set separately. The figure details the number of sequences corresponding to each time interval. Notably, all modalities in this dataset share identical sequence lengths, so we do not distinguish between modalities.
  • Figure 3: Spatial distribution. The spatial distributions of the targets in the training and test sets are examined and visualised using scatter plots and heatmaps. Since the two modalities may exhibit discrepancies in the target distributions due to differences in visibility and other factors, the spatial distributions are computed and presented separately for each modality.
  • Figure 4: Size distribution. Although UAVs, as rigid bodies, do not undergo non-rigid deformation, the complexity of their flight postures leads to substantial variations in the aspect ratios of their bounding boxes.
  • Figure 5: The trajectory statistics. We provide the trajectory-level statistics across the entire dataset, including the number of target trajectories per sequence, the proportion of visible duration for each trajectory, and the frequency with which trajectories disappear and reappear.
  • ...and 9 more figures