D-VAT: End-to-End Visual Active Tracking for Micro Aerial Vehicles

Alberto Dionigi; Simone Felicioni; Mirko Leomanni; Gabriele Costante

D-VAT: End-to-End Visual Active Tracking for Micro Aerial Vehicles

Alberto Dionigi, Simone Felicioni, Mirko Leomanni, Gabriele Costante

TL;DR

D-VAT tackles visual active tracking for micro aerial vehicles through an end-to-end deep reinforcement learning framework that maps monocular RGB input directly to thrust and body-rate commands. Using an asymmetric actor-critic SAC setup, it employs an A-DNN to process image sequences and a C-DNN with privileged state to guide training, enabling continuous control without restrictive assumptions on target or tracker motion. Empirical results show D-VAT outperforms state-of-the-art baselines in photorealistic simulations and generalizes to real-world scenarios via a Mixed-Reality framework, achieving zero-shot transfer without fine-tuning. This approach offers robust, collision-free VAT for MAVs under cluttered and diverse environments, reducing reliance on modular perception-control pipelines and enabling practical deployment with monocular vision.

Abstract

Visual active tracking is a growing research topic in robotics due to its key role in applications such as human assistance, disaster recovery, and surveillance. In contrast to passive tracking, active tracking approaches combine vision and control capabilities to detect and actively track the target. Most of the work in this area focuses on ground robots, while the very few contributions on aerial platforms still pose important design constraints that limit their applicability. To overcome these limitations, in this paper we propose D-VAT, a novel end-to-end visual active tracking methodology based on deep reinforcement learning that is tailored to micro aerial vehicle platforms. The D-VAT agent computes the vehicle thrust and angular velocity commands needed to track the target by directly processing monocular camera measurements. We show that the proposed approach allows for precise and collision-free tracking operations, outperforming different state-of-the-art baselines on simulated environments which differ significantly from those encountered during training. Moreover, we demonstrate a smooth real-world transition to a quadrotor platform with mixed-reality.

D-VAT: End-to-End Visual Active Tracking for Micro Aerial Vehicles

TL;DR

Abstract

Paper Structure (16 sections, 12 equations, 5 figures, 4 tables)

This paper contains 16 sections, 12 equations, 5 figures, 4 tables.

Introduction
Related Work
Contribution
Preliminary Definitions
Approach
Problem Formulation
Deep Reinforcement Learning Strategy
Optimization
Training Environment
Experiments
Experimental Setup
Baselines
Metrics
Comparison Results
DRL Controller Validation with Mixed-Reality
...and 1 more sections

Figures (5)

Figure 1: Overview of the VAT task. The tracker MAV (blue) adjusts its position and orientation so as to keep the target MAV (red) at the center of the camera FoV and at a predefined distance. Our approach exploits an end-to-end DRL-based VAT method that directly maps RGB images into thrust and angular velocity commands that are fed to the tracker.
Figure 2: Overview of the proposed D-VAT architecture. The A-DNN (highlighted in blue) processes a batch of collected RGB images and computes the body-rate and thrust commands fed to the tracker MAV. The state of the tracker is updated according to the dynamic model \ref{['sysmodel']} and the resulting pose is employed by the graphics engine to render the next image. The C-DNN (colored in light green) is instead provided with privileged information (relative position, velocity and acceleration) to facilitate the estimation of the action value function during training.
Figure 3: Examples of the training environment randomization. The tracker (blue) and the target (red) MAVs are spawned in a large room with randomized characteristics.
Figure 4: Images from the photo-realistic environments employed to test the generalization capabilities of D-VAT. From left to right: an urban setting (Urban), a park environment (Park), and an office space (Office).
Figure 5: Mixed-Reality framework: the simulation engine renders the image and collects the RGB observation $I_t$ of the tracker MAV. D-VAT then predicts the control signal $u_t$ to follow the target drone. $u_t$ is directly employed to command the real drone. The new position of the real drone $p_{t+1}$ is used to update the position of the simulated tracker drone and collect the next RGB observation $I_{t+1}$.

D-VAT: End-to-End Visual Active Tracking for Micro Aerial Vehicles

TL;DR

Abstract

D-VAT: End-to-End Visual Active Tracking for Micro Aerial Vehicles

Authors

TL;DR

Abstract

Table of Contents

Figures (5)