Table of Contents
Fetching ...

Layer-Guided UAV Tracking: Enhancing Efficiency and Occlusion Robustness

Yang Zhou, Derui Ding, Ran Sun, Ying Sun, Haohua Zhang

TL;DR

LGTrack is introduced, a unified UAV tracking framework that integrates dynamic layer selection, efficient feature enhancement, and robust representation learning for occlusions that achieves an optimal balance between tracking precision and inference efficiency.

Abstract

Visual object tracking (VOT) plays a pivotal role in unmanned aerial vehicle (UAV) applications. Addressing the trade-off between accuracy and efficiency, especially under challenging conditions like unpredictable occlusion, remains a significant challenge. This paper introduces LGTrack, a unified UAV tracking framework that integrates dynamic layer selection, efficient feature enhancement, and robust representation learning for occlusions. By employing a novel lightweight Global-Grouped Coordinate Attention (GGCA) module, LGTrack captures long-range dependencies and global contexts, enhancing feature discriminability with minimal computational overhead. Additionally, a lightweight Similarity-Guided Layer Adaptation (SGLA) module replaces knowledge distillation, achieving an optimal balance between tracking precision and inference efficiency. Experiments on three datasets demonstrate LGTrack's state-of-the-art real-time speed (258.7 FPS on UAVDT) while maintaining competitive tracking accuracy (82.8\% precision). Code is available at https://github.com/XiaoMoc/LGTrack

Layer-Guided UAV Tracking: Enhancing Efficiency and Occlusion Robustness

TL;DR

LGTrack is introduced, a unified UAV tracking framework that integrates dynamic layer selection, efficient feature enhancement, and robust representation learning for occlusions that achieves an optimal balance between tracking precision and inference efficiency.

Abstract

Visual object tracking (VOT) plays a pivotal role in unmanned aerial vehicle (UAV) applications. Addressing the trade-off between accuracy and efficiency, especially under challenging conditions like unpredictable occlusion, remains a significant challenge. This paper introduces LGTrack, a unified UAV tracking framework that integrates dynamic layer selection, efficient feature enhancement, and robust representation learning for occlusions. By employing a novel lightweight Global-Grouped Coordinate Attention (GGCA) module, LGTrack captures long-range dependencies and global contexts, enhancing feature discriminability with minimal computational overhead. Additionally, a lightweight Similarity-Guided Layer Adaptation (SGLA) module replaces knowledge distillation, achieving an optimal balance between tracking precision and inference efficiency. Experiments on three datasets demonstrate LGTrack's state-of-the-art real-time speed (258.7 FPS on UAVDT) while maintaining competitive tracking accuracy (82.8\% precision). Code is available at https://github.com/XiaoMoc/LGTrack
Paper Structure (18 sections, 12 equations, 6 figures, 5 tables)

This paper contains 18 sections, 12 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Compared to SOTA UAV trackers on UAVDT, our LGTrack-DeiT achieves a record with 82.8% precision and a speed of 258.7 FPS, where ORTrack-DeiT relies on knowledge distillation.
  • Figure 2: Overview of the proposed LGTrack framework based on ViT backbones, including $L$ Transformer blocks. It involves two core co-designed modules (i.e. SGLA and GGCA ) for feature refinement, and a robust learning (i.e. ORR learning) strategy for training. During training, the selection module (i.e. the 3-layer MLP) is determined by using the proposed layer-wise similarity loss function. During inference, GGCA captures long-range dependencies and global contexts, and the 3-layer MLP module in SGLA activates the optimal layer in the last $L-l^\ast$ Transformer blocks and disables redundant layers to avoid waste of computing resources.
  • Figure 3: Overall architecture of GGCA, including the detailed internal structure and feature-processing pipeline, where G is the number of groups, $\oplus$ is the element-wise addition, and $\otimes$ is the dot product.
  • Figure 4: Ablation experiments of various masking methods on UAVDT, where the precision (Prec.) and success rate (Succ.) are used for evaluation. The configuration with Spatial Cox Processes achieves the best results.
  • Figure 5: The impact of different numbers of groups ($G$=1, 2, 4, 8) and pooling methods (Avg, Max, Avg+Max) on UAV123. The precision (Prec.) and success rate (Succ.) are used for evaluation. The configuration with $G$=4 and Avg+Max pooling achieves the best results.
  • ...and 1 more figures