Table of Contents
Fetching ...

OpenFly: A Comprehensive Platform for Aerial Vision-Language Navigation

Yunpeng Gao, Chenhui Li, Zhongrui You, Junli Liu, Zhen Li, Pengan Chen, Qizhi Chen, Zhonghan Tang, Liansheng Wang, Penghui Yang, Yiwen Tang, Yuhang Tang, Shuai Liang, Songyi Zhu, Ziqin Xiong, Yifei Su, Xinyi Ye, Jianan Li, Yan Ding, Dong Wang, Zhigang Wang, Bin Zhao, Xuelong Li

TL;DR

OpenFly addresses the lack of large-scale outdoor aerial Vision-Language Navigation data by integrating multiple rendering engines, a fully automated data-generation toolchain, and a 100k-trajectory aerial VLN benchmark across 18 scenes. It introduces OpenFly-Agent, a keyframe-aware VLN model that emphasizes landmark observations through keyframe selection and visual token merging to improve efficiency and performance. The framework delivers a diverse, scalable real-to-sim dataset and establishes a comprehensive benchmark, with extensive experiments showing superior performance and strong generalization, including real-world validation. This work significantly lowers barriers to developing aerial VLN systems and provides a practical path toward scalable, realistic UAV navigation in language-guided tasks.

Abstract

Vision-Language Navigation (VLN) aims to guide agents by leveraging language instructions and visual cues, playing a pivotal role in embodied AI. Indoor VLN has been extensively studied, whereas outdoor aerial VLN remains underexplored. The potential reason is that outdoor aerial view encompasses vast areas, making data collection more challenging, which results in a lack of benchmarks. To address this problem, we propose OpenFly, a platform comprising various rendering engines, a versatile toolchain, and a large-scale benchmark for aerial VLN. Firstly, we integrate diverse rendering engines and advanced techniques for environment simulation, including Unreal Engine, GTA V, Google Earth, and 3D Gaussian Splatting (3D GS). Particularly, 3D GS supports real-to-sim rendering, further enhancing the realism of our environments. Secondly, we develop a highly automated toolchain for aerial VLN data collection, streamlining point cloud acquisition, scene semantic segmentation, flight trajectory creation, and instruction generation. Thirdly, based on the toolchain, we construct a large-scale aerial VLN dataset with 100k trajectories, covering diverse heights and lengths across 18 scenes. Moreover, we propose OpenFly-Agent, a keyframe-aware VLN model emphasizing key observations during flight. For benchmarking, extensive experiments and analyses are conducted, evaluating several recent VLN methods and showcasing the superiority of our OpenFly platform and agent. The toolchain, dataset, and codes will be open-sourced.

OpenFly: A Comprehensive Platform for Aerial Vision-Language Navigation

TL;DR

OpenFly addresses the lack of large-scale outdoor aerial Vision-Language Navigation data by integrating multiple rendering engines, a fully automated data-generation toolchain, and a 100k-trajectory aerial VLN benchmark across 18 scenes. It introduces OpenFly-Agent, a keyframe-aware VLN model that emphasizes landmark observations through keyframe selection and visual token merging to improve efficiency and performance. The framework delivers a diverse, scalable real-to-sim dataset and establishes a comprehensive benchmark, with extensive experiments showing superior performance and strong generalization, including real-world validation. This work significantly lowers barriers to developing aerial VLN systems and provides a practical path toward scalable, realistic UAV navigation in language-guided tasks.

Abstract

Vision-Language Navigation (VLN) aims to guide agents by leveraging language instructions and visual cues, playing a pivotal role in embodied AI. Indoor VLN has been extensively studied, whereas outdoor aerial VLN remains underexplored. The potential reason is that outdoor aerial view encompasses vast areas, making data collection more challenging, which results in a lack of benchmarks. To address this problem, we propose OpenFly, a platform comprising various rendering engines, a versatile toolchain, and a large-scale benchmark for aerial VLN. Firstly, we integrate diverse rendering engines and advanced techniques for environment simulation, including Unreal Engine, GTA V, Google Earth, and 3D Gaussian Splatting (3D GS). Particularly, 3D GS supports real-to-sim rendering, further enhancing the realism of our environments. Secondly, we develop a highly automated toolchain for aerial VLN data collection, streamlining point cloud acquisition, scene semantic segmentation, flight trajectory creation, and instruction generation. Thirdly, based on the toolchain, we construct a large-scale aerial VLN dataset with 100k trajectories, covering diverse heights and lengths across 18 scenes. Moreover, we propose OpenFly-Agent, a keyframe-aware VLN model emphasizing key observations during flight. For benchmarking, extensive experiments and analyses are conducted, evaluating several recent VLN methods and showcasing the superiority of our OpenFly platform and agent. The toolchain, dataset, and codes will be open-sourced.

Paper Structure

This paper contains 30 sections, 11 figures, 7 tables.

Figures (11)

  • Figure 1: Overview of OpenFly. This work consists of the integration of 4 rendering engines, an automatic toolchain for data generation, a large-scale aerial VLN dataset comprising 100K trajectories and instructions, and a keyframe-aware VLN model emphasizing key observations.
  • Figure 2: Framework of the automatic data generation. Various rendering engines are first integrated, providing diverse high-quality scenes. Built on these, several interfaces and tools are developed, enabling automated generation of trajectories and instructions.
  • Figure 3: Statistical analysis of the generated data. (a) Length and height distributions of trajectories. (b) Action distributions. (c) Word cloud of nouns. (d) Word cloud of verbs.
  • Figure 4: The architecture of OpenFly-Agent. Keyframes at the time of action transitions are selected to extract crucial observations as the history, with corresponding visual tokens compressed to reduce the computational burden.
  • Figure 5: High-quality examples from different rendering engines and techniques, including several large cities such as Shanghai, Guangzhou, Los Angeles, Osaka, and etc., cover an area of over a hundred square kilometers in total. 3D GS provides five large campus scenes, further enhancing the diversity and realism of the data.
  • ...and 6 more figures