AutoFly: Vision-Language-Action Model for UAV Autonomous Navigation in the Wild

Xiaolou Sun; Wufei Si; Wenhui Ni; Yuntian Li; Dongming Wu; Fei Xie; Runwei Guan; He-Yang Xu; Henghui Ding; Yuan Wu; Yutao Yue; Yongming Huang; Hui Xiong

AutoFly: Vision-Language-Action Model for UAV Autonomous Navigation in the Wild

Xiaolou Sun, Wufei Si, Wenhui Ni, Yuntian Li, Dongming Wu, Fei Xie, Runwei Guan, He-Yang Xu, Henghui Ding, Yuan Wu, Yutao Yue, Yongming Huang, Hui Xiong

TL;DR

AutoFly tackles autonomous UAV navigation under unknown outdoor conditions by formulating VLN as a Vision-Language-Action problem and introducing a pseudo-depth encoder to enrich depth-aware spatial reasoning. The method uses a two-stage training regime that first aligns vision and language, then fine-tunes spatial actions with depth-informed multimodal fusion, including a Siamese MLP depth-vision-language projector. A large-scale autonomous navigation dataset combining simulated AirSim trajectories and real-world flights supports robust sim-to-real transfer, with trajectory rebalancing to mitigate long-horizon data bias. Experiments show AutoFly achieves a 3.9% higher navigation success rate than strong baselines and maintains performance across simulated and real environments, indicating practical applicability for tasks like search and rescue, environmental monitoring, and autonomous delivery.

Abstract

Vision-language navigation (VLN) requires intelligent agents to navigate environments by interpreting linguistic instructions alongside visual observations, serving as a cornerstone task in Embodied AI. Current VLN research for unmanned aerial vehicles (UAVs) relies on detailed, pre-specified instructions to guide the UAV along predetermined routes. However, real-world outdoor exploration typically occurs in unknown environments where detailed navigation instructions are unavailable. Instead, only coarse-grained positional or directional guidance can be provided, requiring UAVs to autonomously navigate through continuous planning and obstacle avoidance. To bridge this gap, we propose AutoFly, an end-to-end Vision-Language-Action (VLA) model for autonomous UAV navigation. AutoFly incorporates a pseudo-depth encoder that derives depth-aware features from RGB inputs to enhance spatial reasoning, coupled with a progressive two-stage training strategy that effectively aligns visual, depth, and linguistic representations with action policies. Moreover, existing VLN datasets have fundamental limitations for real-world autonomous navigation, stemming from their heavy reliance on explicit instruction-following over autonomous decision-making and insufficient real-world data. To address these issues, we construct a novel autonomous navigation dataset that shifts the paradigm from instruction-following to autonomous behavior modeling through: (1) trajectory collection emphasizing continuous obstacle avoidance, autonomous planning, and recognition workflows; (2) comprehensive real-world data integration. Experimental results demonstrate that AutoFly achieves a 3.9% higher success rate compared to state-of-the-art VLA baselines, with consistent performance across simulated and real environments.

AutoFly: Vision-Language-Action Model for UAV Autonomous Navigation in the Wild

TL;DR

Abstract

Paper Structure (37 sections, 12 equations, 17 figures, 9 tables)

This paper contains 37 sections, 12 equations, 17 figures, 9 tables.

Introduction
Related Work
Method
Task Formulation
VLA Model for Autonomous Navigation
Autonomous Navigation Dataset
Training Paradigm of AutoFly
Experiments
Implementation Details
Simulation Performance
Real-world Performance
Ablation Experiments
Conclusion
Acknowledgments
Appendix
...and 22 more sections

Figures (17)

Figure 1: Analysis of previous methods and our AutoFly. Left: Previous methods lee2024citynavliu2023aerialvln rely on dedicated, step-by-step instructions that specify predetermined flight paths with explicit waypoints and maneuvers. Right: Our AutoFly performs autonomous navigation with concise natural language instructions, and coarse positional or directional information.
Figure 2: Framework of AutoFly. AutoFly takes RGB observations and linguistic instructions as inputs and directly outputs high-level actions. These actions, combined with initial actions derived from coarse-grained positional or directional information, form action sequences.
Figure 3: Overview of autonomous navigation dataset statistical analysis.
Figure 4: Comparison of three paradigms for integrating depth information during fine-tuning: (a) Siamese MLP projector, (b) Non-Siamese MLP projector, (c) Direct depth integration.
Figure 5: Visualization of AutoFly in the real indoor environment. The experimental arena is a structured indoor environment designed for autonomous navigation and mapping tasks. We have achieved a 60% success rate in real-environment testing. For more visualizations and details, please refer to the Appendix \ref{['Visualization']}.
...and 12 more figures

AutoFly: Vision-Language-Action Model for UAV Autonomous Navigation in the Wild

TL;DR

Abstract

AutoFly: Vision-Language-Action Model for UAV Autonomous Navigation in the Wild

Authors

TL;DR

Abstract

Table of Contents

Figures (17)