A Multimodal Learning-based Approach for Autonomous Landing of UAV
Francisco Neves, Luís Branco, Maria Pereira, Rafael Claro, Andry Pinto
TL;DR
This work targets autonomous UAV landing under challenging conditions by coupling a multimodal transformer-based detector (ViTAL) with a discretized DQN-based lander. ViTAL fuses visual, thermal, and LiDAR inputs to detect a landing marker and output a precise bounding box, trained with CIoU and Focal losses to handle imbalance; the RL lander operates on a discretized 3D state space to produce robust landing maneuvers, trained in simulation and validated outdoors. The detector demonstrates robustness to modality failures and weather variations, with edge-ready inference times, while the RL lander achieves centimeter-level landing accuracy (around $0.25$ m) under wind disturbances, illustrating practical applicability for real-time, autonomous UAV landing. Overall, the integrated framework provides centimeter-precision landing with reliable edge deployment, leveraging multimodal perception and learning-based decision-making to outperform traditional model-based controllers in unpredictable environments.
Abstract
In the field of autonomous Unmanned Aerial Vehicles (UAVs) landing, conventional approaches fall short in delivering not only the required precision but also the resilience against environmental disturbances. Yet, learning-based algorithms can offer promising solutions by leveraging their ability to learn the intelligent behaviour from data. On one hand, this paper introduces a novel multimodal transformer-based Deep Learning detector, that can provide reliable positioning for precise autonomous landing. It surpasses standard approaches by addressing individual sensor limitations, achieving high reliability even in diverse weather and sensor failure conditions. It was rigorously validated across varying environments, achieving optimal true positive rates and average precisions of up to 90%. On the other hand, it is proposed a Reinforcement Learning (RL) decision-making model, based on a Deep Q-Network (DQN) rationale. Initially trained in sumlation, its adaptive behaviour is successfully transferred and validated in a real outdoor scenario. Furthermore, this approach demonstrates rapid inference times of approximately 5ms, validating its applicability on edge devices.
