Table of Contents
Fetching ...

Architecture and evaluation protocol for transformer-based visual object tracking in UAV applications

Augustin Borne, Pierre Notin, Christophe Hennequin, Sebastien Changey, Stephane Bazeille, Christophe Cudel, Franz Quint

TL;DR

An Modular Asynchronous Tracking Architecture (MATA) is proposed that combines a transformer-based tracker with an Extended Kalman Filter, integrating ego-motion compensation from sparse optical flow and an object trajectory model to quantify how long a tracker can sustain a tracking sequence without external help.

Abstract

Object tracking from Unmanned Aerial Vehicles (UAVs) is challenged by platform dynamics, camera motion, and limited onboard resources. Existing visual trackers either lack robustness in complex scenarios or are too computationally demanding for real-time embedded use. We propose an Modular Asynchronous Tracking Architecture (MATA) that combines a transformer-based tracker with an Extended Kalman Filter, integrating ego-motion compensation from sparse optical flow and an object trajectory model. We further introduce a hardware-independent, embedded oriented evaluation protocol and a new metric called Normalized time to Failure (NT2F) to quantify how long a tracker can sustain a tracking sequence without external help. Experiments on UAV benchmarks, including an augmented UAV123 dataset with synthetic occlusions, show consistent improvements in Success and NT2F metrics across multiple tracking processing frequency. A ROS 2 implementation on a Nvidia Jetson AGX Orin confirms that the evaluation protocol more closely matches real-time performance on embedded systems.

Architecture and evaluation protocol for transformer-based visual object tracking in UAV applications

TL;DR

An Modular Asynchronous Tracking Architecture (MATA) is proposed that combines a transformer-based tracker with an Extended Kalman Filter, integrating ego-motion compensation from sparse optical flow and an object trajectory model to quantify how long a tracker can sustain a tracking sequence without external help.

Abstract

Object tracking from Unmanned Aerial Vehicles (UAVs) is challenged by platform dynamics, camera motion, and limited onboard resources. Existing visual trackers either lack robustness in complex scenarios or are too computationally demanding for real-time embedded use. We propose an Modular Asynchronous Tracking Architecture (MATA) that combines a transformer-based tracker with an Extended Kalman Filter, integrating ego-motion compensation from sparse optical flow and an object trajectory model. We further introduce a hardware-independent, embedded oriented evaluation protocol and a new metric called Normalized time to Failure (NT2F) to quantify how long a tracker can sustain a tracking sequence without external help. Experiments on UAV benchmarks, including an augmented UAV123 dataset with synthetic occlusions, show consistent improvements in Success and NT2F metrics across multiple tracking processing frequency. A ROS 2 implementation on a Nvidia Jetson AGX Orin confirms that the evaluation protocol more closely matches real-time performance on embedded systems.
Paper Structure (14 sections, 5 equations, 3 figures, 4 tables)

This paper contains 14 sections, 5 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Overview of the Modular Asynchronous Tracking Architecture: Example with tracker running at 10 Hz, estimation filter at 30 Hz and ego motion at 30 Hz.
  • Figure 2: Comparison of evaluation protocols: the long-term protocol (LTP) commonly used in the literature; the down-sampling protocol (DSP), which ignores processing delays; and the proposed asynchronous frame processing (EOP) protocol, which accounts for processing delays. Example of a VOT tracker running at 7.5–10 Hz, effectively processing every fourth frame. Differences in prediction rate can be observed between a classical VOT tracker (see EOP timeline) and the proposed MATA architecture.
  • Figure 3: Example of shape applied to sequences on UAV123 dataset