A Multimodal Learning-based Approach for Autonomous Landing of UAV

Francisco Neves; Luís Branco; Maria Pereira; Rafael Claro; Andry Pinto

A Multimodal Learning-based Approach for Autonomous Landing of UAV

Francisco Neves, Luís Branco, Maria Pereira, Rafael Claro, Andry Pinto

TL;DR

This work targets autonomous UAV landing under challenging conditions by coupling a multimodal transformer-based detector (ViTAL) with a discretized DQN-based lander. ViTAL fuses visual, thermal, and LiDAR inputs to detect a landing marker and output a precise bounding box, trained with CIoU and Focal losses to handle imbalance; the RL lander operates on a discretized 3D state space to produce robust landing maneuvers, trained in simulation and validated outdoors. The detector demonstrates robustness to modality failures and weather variations, with edge-ready inference times, while the RL lander achieves centimeter-level landing accuracy (around $0.25$ m) under wind disturbances, illustrating practical applicability for real-time, autonomous UAV landing. Overall, the integrated framework provides centimeter-precision landing with reliable edge deployment, leveraging multimodal perception and learning-based decision-making to outperform traditional model-based controllers in unpredictable environments.

Abstract

In the field of autonomous Unmanned Aerial Vehicles (UAVs) landing, conventional approaches fall short in delivering not only the required precision but also the resilience against environmental disturbances. Yet, learning-based algorithms can offer promising solutions by leveraging their ability to learn the intelligent behaviour from data. On one hand, this paper introduces a novel multimodal transformer-based Deep Learning detector, that can provide reliable positioning for precise autonomous landing. It surpasses standard approaches by addressing individual sensor limitations, achieving high reliability even in diverse weather and sensor failure conditions. It was rigorously validated across varying environments, achieving optimal true positive rates and average precisions of up to 90%. On the other hand, it is proposed a Reinforcement Learning (RL) decision-making model, based on a Deep Q-Network (DQN) rationale. Initially trained in sumlation, its adaptive behaviour is successfully transferred and validated in a real outdoor scenario. Furthermore, this approach demonstrates rapid inference times of approximately 5ms, validating its applicability on edge devices.

A Multimodal Learning-based Approach for Autonomous Landing of UAV

TL;DR

m) under wind disturbances, illustrating practical applicability for real-time, autonomous UAV landing. Overall, the integrated framework provides centimeter-precision landing with reliable edge deployment, leveraging multimodal perception and learning-based decision-making to outperform traditional model-based controllers in unpredictable environments.

Abstract

Paper Structure (21 sections, 7 equations, 7 figures, 8 tables, 1 algorithm)

This paper contains 21 sections, 7 equations, 7 figures, 8 tables, 1 algorithm.

Introduction
Learning-based Autonomous Lander
Multimodal Landing Target Detector
Architecture
Backbone
Transformer Encoder
MLP heads
Loss function
Reinforcement Learning Decision-Making Lander
Problem Formulation
Results
Multimodal Landing Target Detector
Training
Real Experiments
Modality failure test
...and 6 more sections

Figures (7)

Figure 1: The multimodal landing target detector architecture.
Figure 2: The RL decision-making lander architecture.
Figure 3: Examples of dataset samples with various altitude, positioning and modality activation cases.
Figure 4: The final representations after modality failure. In the first column, there are two examples of representations where all modalities are activated. In the second, third, and fourth columns there are examples of a disabled LiDAR, thermal and visual sensors, respectively. In the fifth, sixth and seventh column there are examples of two disabled sensors such as LiDAR and thermal, LiDAR and visual, and thermal and visual sensors, respectively.
Figure 5: The detection results upon weather restricted samples. The first three columns represent, from left to right, lighting variations of +10% , +50%, and +90% of the image intensity. This effect is manifested through the increased intensity of the visual (blue) channel. The fourth, fifth, and sixth columns represent, from left to right, lighting variations of -10% , -50%, and -90% of the image intensity. This effect is manifested through the dimmering of the visual (blue) channel. The last three columns represent, from left to right, stochastic fog effect of $[10\%,50\%]$, $[50\%,90\%]$, and $[90\%,100\%]$.
...and 2 more figures

A Multimodal Learning-based Approach for Autonomous Landing of UAV

TL;DR

Abstract

A Multimodal Learning-based Approach for Autonomous Landing of UAV

Authors

TL;DR

Abstract

Table of Contents

Figures (7)