Table of Contents
Fetching ...

Target-aware Bidirectional Fusion Transformer for Aerial Object Tracking

Xinglong Sun, Haijiang Sun, Shan Jiang, Jiacheng Wang, Jiasong Wang

TL;DR

A novel target-aware bidirectional fusion transformer for UAV tracking based on linear separable attentions, which is able to combine the shallow and the deep features from both forward and backward directions, providing the adjusted local cues for location and global semantics for identification, respectively.

Abstract

The trackers based on lightweight neural networks have achieved great success in the field of aerial remote sensing, most of which aggregate multi-stage deep features to lift the tracking quality. However, existing algorithms usually only generate single-stage fusion features for state decision, which ignore that diverse kinds of features are required for identifying and locating the object, limiting the robustness and precision of tracking. In this paper, we propose a novel target-aware Bidirectional Fusion transformer (BFTrans) for UAV tracking. Specifically, we first present a two-stream fusion network based on linear self and cross attentions, which can combine the shallow and the deep features from both forward and backward directions, providing the adjusted local details for location and global semantics for recognition. Besides, a target-aware positional encoding strategy is designed for the above fusion model, which is helpful to perceive the object-related attributes during the fusion phase. Finally, the proposed method is evaluated on several popular UAV benchmarks, including UAV-123, UAV20L and UAVTrack112. Massive experimental results demonstrate that our approach can exceed other state-of-the-art trackers and run with an average speed of 30.5 FPS on embedded platform, which is appropriate for practical drone deployments.

Target-aware Bidirectional Fusion Transformer for Aerial Object Tracking

TL;DR

A novel target-aware bidirectional fusion transformer for UAV tracking based on linear separable attentions, which is able to combine the shallow and the deep features from both forward and backward directions, providing the adjusted local cues for location and global semantics for identification, respectively.

Abstract

The trackers based on lightweight neural networks have achieved great success in the field of aerial remote sensing, most of which aggregate multi-stage deep features to lift the tracking quality. However, existing algorithms usually only generate single-stage fusion features for state decision, which ignore that diverse kinds of features are required for identifying and locating the object, limiting the robustness and precision of tracking. In this paper, we propose a novel target-aware Bidirectional Fusion transformer (BFTrans) for UAV tracking. Specifically, we first present a two-stream fusion network based on linear self and cross attentions, which can combine the shallow and the deep features from both forward and backward directions, providing the adjusted local details for location and global semantics for recognition. Besides, a target-aware positional encoding strategy is designed for the above fusion model, which is helpful to perceive the object-related attributes during the fusion phase. Finally, the proposed method is evaluated on several popular UAV benchmarks, including UAV-123, UAV20L and UAVTrack112. Massive experimental results demonstrate that our approach can exceed other state-of-the-art trackers and run with an average speed of 30.5 FPS on embedded platform, which is appropriate for practical drone deployments.

Paper Structure

This paper contains 18 sections, 8 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Qualitative comparisons of our tracker with four state-of-the-art algorithms on several difficult sequences.
  • Figure 2: Overview of the proposed approach, which consists of backbone, target-aware bidirectional fusion transformer and state prediction module. The multi-stage correlation features are aggregated by both forward and backward streams of fusion transformer, generating more appropriate correlation maps for classification and regression, respectively.
  • Figure 3: Structure of our target-aware bidirectional fusion transformer, in which both self and cross attention blocks are utilized to analyze features.
  • Figure 4: Framework of target-aware positional encoding module, mainly containing the channel and the spatial encoding blocks.
  • Figure 5: Success scores of diverse attributes on UAV123 dataset, where the values in parentheses depicts the minimum and maximum scores of all trackers on each attribute.