AerialVLN: Vision-and-Language Navigation for UAVs

Shubo Liu; Hongsheng Zhang; Yuankai Qi; Peng Wang; Yaning Zhang; Qi Wu

AerialVLN: Vision-and-Language Navigation for UAVs

Shubo Liu, Hongsheng Zhang, Yuankai Qi, Peng Wang, Yaning Zhang, Qi Wu

TL;DR

AerialVLN introduces a large-scale, city-level Vision-and-Language Navigation task for UAVs, expanding to continuous 4-DOF aerial control and first-person perception in dynamic outdoor environments. The authors provide an Unreal Engine 4 + AirSim simulator, 25 city scenes, over 8k human-generated trajectories with aligned instructions, and a cross-modal CMA baseline extended with look-ahead guidance to tackle 3D navigation. Experimental results show a substantial gap to human performance, with the Look-ahead CMA variant offering meaningful improvements and the modality ablation confirming the necessity of joint vision-language reasoning. The work establishes a new benchmark and baseline methodology for long-horizon, open-world aerial VLN and suggests clear avenues for future improvements in 3D, cross-modal, and obstacle-aware navigation.

Abstract

Recently emerged Vision-and-Language Navigation (VLN) tasks have drawn significant attention in both computer vision and natural language processing communities. Existing VLN tasks are built for agents that navigate on the ground, either indoors or outdoors. However, many tasks require intelligent agents to carry out in the sky, such as UAV-based goods delivery, traffic/security patrol, and scenery tour, to name a few. Navigating in the sky is more complicated than on the ground because agents need to consider the flying height and more complex spatial relationship reasoning. To fill this gap and facilitate research in this field, we propose a new task named AerialVLN, which is UAV-based and towards outdoor environments. We develop a 3D simulator rendered by near-realistic pictures of 25 city-level scenarios. Our simulator supports continuous navigation, environment extension and configuration. We also proposed an extended baseline model based on the widely-used cross-modal-alignment (CMA) navigation methods. We find that there is still a significant gap between the baseline model and human performance, which suggests AerialVLN is a new challenging task. Dataset and code is available at https://github.com/AirVLN/AirVLN.

AerialVLN: Vision-and-Language Navigation for UAVs

TL;DR

Abstract

Paper Structure (15 sections, 7 figures, 5 tables)

This paper contains 15 sections, 7 figures, 5 tables.

Introduction
Related Work
The AerialVLN Task
Simulator
Dataset
Data Collection
Data Analysis
Experiment and Results
Evaluation Metrics
Results
Baselines
Results
Modality Ablation Study
Conclusion
Acknowledgement

Figures (7)

Figure 2: Statistics of nouns and verbs.
Figure 3: Instruction length and number of actions.
Figure 4: Main architecture of the Cross-Modal Attention model
Figure 5: Illustration of Look-ahead Guidance. 'A' denotes starting location; '$\star$' denotes destination; 'X' denotes current location; Blue path denotes ground-truth; Yellow path denotes "generated ground-truth" when the agent deviates from the real ground-truth path.
Figure 6: Visualisation of a successful navigation of our LAG model. Green arrows indicate horizontal movement motions (Move Forward, Move Left/Right); blue arrows represent vertical motion (Move Up/Down) and horizontal rotation (Turn Left/Right). The final red circle denotes Stop. We highlight aligned landmarks by coloured bounding boxes in images and words in the instruction using the same colour. The superscript of words denotes the index of the corresponding action in images.
...and 2 more figures

AerialVLN: Vision-and-Language Navigation for UAVs

TL;DR

Abstract

AerialVLN: Vision-and-Language Navigation for UAVs

Authors

TL;DR

Abstract

Table of Contents

Figures (7)