AutoFly: Vision-Language-Action Model for UAV Autonomous Navigation in the Wild
Xiaolou Sun, Wufei Si, Wenhui Ni, Yuntian Li, Dongming Wu, Fei Xie, Runwei Guan, He-Yang Xu, Henghui Ding, Yuan Wu, Yutao Yue, Yongming Huang, Hui Xiong
TL;DR
AutoFly tackles autonomous UAV navigation under unknown outdoor conditions by formulating VLN as a Vision-Language-Action problem and introducing a pseudo-depth encoder to enrich depth-aware spatial reasoning. The method uses a two-stage training regime that first aligns vision and language, then fine-tunes spatial actions with depth-informed multimodal fusion, including a Siamese MLP depth-vision-language projector. A large-scale autonomous navigation dataset combining simulated AirSim trajectories and real-world flights supports robust sim-to-real transfer, with trajectory rebalancing to mitigate long-horizon data bias. Experiments show AutoFly achieves a 3.9% higher navigation success rate than strong baselines and maintains performance across simulated and real environments, indicating practical applicability for tasks like search and rescue, environmental monitoring, and autonomous delivery.
Abstract
Vision-language navigation (VLN) requires intelligent agents to navigate environments by interpreting linguistic instructions alongside visual observations, serving as a cornerstone task in Embodied AI. Current VLN research for unmanned aerial vehicles (UAVs) relies on detailed, pre-specified instructions to guide the UAV along predetermined routes. However, real-world outdoor exploration typically occurs in unknown environments where detailed navigation instructions are unavailable. Instead, only coarse-grained positional or directional guidance can be provided, requiring UAVs to autonomously navigate through continuous planning and obstacle avoidance. To bridge this gap, we propose AutoFly, an end-to-end Vision-Language-Action (VLA) model for autonomous UAV navigation. AutoFly incorporates a pseudo-depth encoder that derives depth-aware features from RGB inputs to enhance spatial reasoning, coupled with a progressive two-stage training strategy that effectively aligns visual, depth, and linguistic representations with action policies. Moreover, existing VLN datasets have fundamental limitations for real-world autonomous navigation, stemming from their heavy reliance on explicit instruction-following over autonomous decision-making and insufficient real-world data. To address these issues, we construct a novel autonomous navigation dataset that shifts the paradigm from instruction-following to autonomous behavior modeling through: (1) trajectory collection emphasizing continuous obstacle avoidance, autonomous planning, and recognition workflows; (2) comprehensive real-world data integration. Experimental results demonstrate that AutoFly achieves a 3.9% higher success rate compared to state-of-the-art VLA baselines, with consistent performance across simulated and real environments.
