Table of Contents
Fetching ...

Similarity-Guided Layer-Adaptive Vision Transformer for UAV Tracking

Chaocan Xue, Bineng Zhong, Qihua Liang, Yaozong Zheng, Ning Li, Yuanliang Xue, Shuxiang Song

TL;DR

This work addresses the efficiency gap of ViT-based UAV trackers by revealing layer redundancy in lightweight ViTs and exploiting a saturation phenomenon in deep layers. It introduces a similarity-guided layer adaptation framework with a small selection module and a layer-wise similarity loss, enabling dynamic pruning of redundant layers while retaining a representative layer to preserve accuracy. The resulting SGLATrack family achieves state-of-the-art real-time performance (e.g., around $225$ FPS on GPU) across six UAV benchmarks with competitive AUC/precision, and demonstrates practical viability on embedded hardware (e.g., $33$ FPS on Jetson TX2). This approach provides a scalable, architecture-agnostic path to deploy accurate ViT-based UAV trackers on resource-constrained platforms, advancing real-time aerial tracking capabilities.

Abstract

Vision transformers (ViTs) have emerged as a popular backbone for visual tracking. However, complete ViT architectures are too cumbersome to deploy for unmanned aerial vehicle (UAV) tracking which extremely emphasizes efficiency. In this study, we discover that many layers within lightweight ViT-based trackers tend to learn relatively redundant and repetitive target representations. Based on this observation, we propose a similarity-guided layer adaptation approach to optimize the structure of ViTs. Our approach dynamically disables a large number of representation-similar layers and selectively retains only a single optimal layer among them, aiming to achieve a better accuracy-speed trade-off. By incorporating this approach into existing ViTs, we tailor previously complete ViT architectures into an efficient similarity-guided layer-adaptive framework, namely SGLATrack, for real-time UAV tracking. Extensive experiments on six tracking benchmarks verify the effectiveness of the proposed approach, and show that our SGLATrack achieves a state-of-the-art real-time speed while maintaining competitive tracking precision. Codes and models are available at https://github.com/GXNU-ZhongLab/SGLATrack.

Similarity-Guided Layer-Adaptive Vision Transformer for UAV Tracking

TL;DR

This work addresses the efficiency gap of ViT-based UAV trackers by revealing layer redundancy in lightweight ViTs and exploiting a saturation phenomenon in deep layers. It introduces a similarity-guided layer adaptation framework with a small selection module and a layer-wise similarity loss, enabling dynamic pruning of redundant layers while retaining a representative layer to preserve accuracy. The resulting SGLATrack family achieves state-of-the-art real-time performance (e.g., around FPS on GPU) across six UAV benchmarks with competitive AUC/precision, and demonstrates practical viability on embedded hardware (e.g., FPS on Jetson TX2). This approach provides a scalable, architecture-agnostic path to deploy accurate ViT-based UAV trackers on resource-constrained platforms, advancing real-time aerial tracking capabilities.

Abstract

Vision transformers (ViTs) have emerged as a popular backbone for visual tracking. However, complete ViT architectures are too cumbersome to deploy for unmanned aerial vehicle (UAV) tracking which extremely emphasizes efficiency. In this study, we discover that many layers within lightweight ViT-based trackers tend to learn relatively redundant and repetitive target representations. Based on this observation, we propose a similarity-guided layer adaptation approach to optimize the structure of ViTs. Our approach dynamically disables a large number of representation-similar layers and selectively retains only a single optimal layer among them, aiming to achieve a better accuracy-speed trade-off. By incorporating this approach into existing ViTs, we tailor previously complete ViT architectures into an efficient similarity-guided layer-adaptive framework, namely SGLATrack, for real-time UAV tracking. Extensive experiments on six tracking benchmarks verify the effectiveness of the proposed approach, and show that our SGLATrack achieves a state-of-the-art real-time speed while maintaining competitive tracking precision. Codes and models are available at https://github.com/GXNU-ZhongLab/SGLATrack.

Paper Structure

This paper contains 16 sections, 8 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Comparison of SGLATrack against other UAV trackers on UAV123. Our SGLATrack demonstrates state-of-the-art performance with an AUC score of 66.9%, while running efficiently at nearly 225 FPS on GPU and approximately 75 FPS on CPU.
  • Figure 2: Layer-by-layer feature changes and AUC changes. Higher cosine similarity denotes fewer feature changes.
  • Figure 3: Overall architecture of the proposed SGLATrack. It is composed of a one-stream backbone, a typical prediction head, and a selection module. During training, the selection module is optimized by the proposed layer-wise similarity loss. During inference, the selection module disables redundant ViT layers and selectively retains an optimal layer among them to alleviate performance drop.
  • Figure 4: Qualitative comparisons of our tracker against other three SOTA trackers. Best viewed in color and by zooming in.
  • Figure 5: Comparison of attention maps. Note that w/o and w/ denote the tracker without and with selection module, respectively.