Table of Contents
Fetching ...

MUST: The First Dataset and Unified Framework for Multispectral UAV Single Object Tracking

Haolin Qin, Tingfa Xu, Tianhao Li, Zhenxiang Chen, Tao Feng, Jianan Li

TL;DR

This work tackles the limitations of RGB UAV tracking under challenging conditions by introducing MUST, the first large-scale multispectral UAV tracking dataset (250 sequences, 43k frames, 8 spectral bands at 1200×900) with 12 challenge attributes. It also presents UNTrack, a Unified Spectral-Spatial-Temporal Tracker built on a Unified Asymmetric Transformer, a spectral background elimination mechanism, and a Spectrum Prompt Encoder, all enhanced by an MSI parameter reconstruction strategy for initialization. UNTrack integrates spectral, spatial, and temporal cues through asymmetric attention that prunes nonessential interactions, updates a spectrum prompt across frames, and outputs precise bounding boxes via a dual-branch head trained with $\mathcal{L}=\mathcal{L}_{cls}+\lambda_1\mathcal{L}_1+\lambda_2\mathcal{L}_{GIoU}$. Empirical results show UNTrack achieves state-of-the-art performance on MUST, plus favorable efficiency and versatility on MSI-based and RGB-based tracking benchmarks, signaling strong potential for real-world multispectral UAV tracking applications.

Abstract

UAV tracking faces significant challenges in real-world scenarios, such as small-size targets and occlusions, which limit the performance of RGB-based trackers. Multispectral images (MSI), which capture additional spectral information, offer a promising solution to these challenges. However, progress in this field has been hindered by the lack of relevant datasets. To address this gap, we introduce the first large-scale Multispectral UAV Single Object Tracking dataset (MUST), which includes 250 video sequences spanning diverse environments and challenges, providing a comprehensive data foundation for multispectral UAV tracking. We also propose a novel tracking framework, UNTrack, which encodes unified spectral, spatial, and temporal features from spectrum prompts, initial templates, and sequential searches. UNTrack employs an asymmetric transformer with a spectral background eliminate mechanism for optimal relationship modeling and an encoder that continuously updates the spectrum prompt to refine tracking, improving both accuracy and efficiency. Extensive experiments show that our proposed UNTrack outperforms state-of-the-art UAV trackers. We believe our dataset and framework will drive future research in this area. The dataset is available on https://github.com/q2479036243/MUST-Multispectral-UAV-Single-Object-Tracking.

MUST: The First Dataset and Unified Framework for Multispectral UAV Single Object Tracking

TL;DR

This work tackles the limitations of RGB UAV tracking under challenging conditions by introducing MUST, the first large-scale multispectral UAV tracking dataset (250 sequences, 43k frames, 8 spectral bands at 1200×900) with 12 challenge attributes. It also presents UNTrack, a Unified Spectral-Spatial-Temporal Tracker built on a Unified Asymmetric Transformer, a spectral background elimination mechanism, and a Spectrum Prompt Encoder, all enhanced by an MSI parameter reconstruction strategy for initialization. UNTrack integrates spectral, spatial, and temporal cues through asymmetric attention that prunes nonessential interactions, updates a spectrum prompt across frames, and outputs precise bounding boxes via a dual-branch head trained with . Empirical results show UNTrack achieves state-of-the-art performance on MUST, plus favorable efficiency and versatility on MSI-based and RGB-based tracking benchmarks, signaling strong potential for real-world multispectral UAV tracking applications.

Abstract

UAV tracking faces significant challenges in real-world scenarios, such as small-size targets and occlusions, which limit the performance of RGB-based trackers. Multispectral images (MSI), which capture additional spectral information, offer a promising solution to these challenges. However, progress in this field has been hindered by the lack of relevant datasets. To address this gap, we introduce the first large-scale Multispectral UAV Single Object Tracking dataset (MUST), which includes 250 video sequences spanning diverse environments and challenges, providing a comprehensive data foundation for multispectral UAV tracking. We also propose a novel tracking framework, UNTrack, which encodes unified spectral, spatial, and temporal features from spectrum prompts, initial templates, and sequential searches. UNTrack employs an asymmetric transformer with a spectral background eliminate mechanism for optimal relationship modeling and an encoder that continuously updates the spectrum prompt to refine tracking, improving both accuracy and efficiency. Extensive experiments show that our proposed UNTrack outperforms state-of-the-art UAV trackers. We believe our dataset and framework will drive future research in this area. The dataset is available on https://github.com/q2479036243/MUST-Multispectral-UAV-Single-Object-Tracking.

Paper Structure

This paper contains 24 sections, 8 equations, 14 figures, 11 tables.

Figures (14)

  • Figure 1: In challenging scenarios, the spatial features (e.g., color and texture) of the tracked target closely resemble those of the background, making differentiation and localization difficult. However, the target’s spectral information differs significantly from the background and aligns with the template’s spectral data, providing robust features for reliable tracking.
  • Figure 2: Examples of representative challenge scenarios, with targets marked by red boxes and displayed in magnified views.
  • Figure 3: Statistics of the MUST dataset: (a) Distribution of video sequences across 12 key challenges, (b) Target scale distribution and consistency across different subsets.
  • Figure 4: The proposed UNTrack consists of three components: the unified asymmetric transformer, the spectrum prompt encoder, and the prediction head. UNTrack takes the spectrum prompt, initial template, and sequential search as unified inputs and outputs the target bounding box for each frame. The encoded prompt tokens of the current frame update the spectrum prompt for subsequent tracking.
  • Figure 5: Information streams in the attention mechanism. The interaction between prompt, template, and search corresponds to the nine blocks in the attention map.
  • ...and 9 more figures