Table of Contents
Fetching ...

Towards Realistic UAV Vision-Language Navigation: Platform, Benchmark, and Methodology

Xiangyu Wang, Donglin Yang, Ziqin Wang, Hohin Kwan, Jinyu Chen, Wenjun Wu, Hongsheng Li, Yue Liao, Si Liu

TL;DR

This work introduces OpenUAV, a realistic UAV VLN platform built with UE4 and AirSim that supports continuous 6-DoF trajectories and multi-sensor data. It further provides UAV-Need-Help, a target-oriented VLN benchmark with assistant-guided navigation at multiple levels, and a UAV navigation LLM that yields hierarchical trajectories via multimodal inputs and a backtracking data-augmentation strategy. The combination enables realistic trajectory-based UAV VLN research, demonstrated by a 12k-trajectory dataset and strong performance improvements over baselines, while still leaving a gap to human operators. The work highlights avenues for autonomous UAV navigation and sim-to-real transfer to bridge simulation and real-world deployment.

Abstract

Developing agents capable of navigating to a target location based on language instructions and visual information, known as vision-language navigation (VLN), has attracted widespread interest. Most research has focused on ground-based agents, while UAV-based VLN remains relatively underexplored. Recent efforts in UAV vision-language navigation predominantly adopt ground-based VLN settings, relying on predefined discrete action spaces and neglecting the inherent disparities in agent movement dynamics and the complexity of navigation tasks between ground and aerial environments. To address these disparities and challenges, we propose solutions from three perspectives: platform, benchmark, and methodology. To enable realistic UAV trajectory simulation in VLN tasks, we propose the OpenUAV platform, which features diverse environments, realistic flight control, and extensive algorithmic support. We further construct a target-oriented VLN dataset consisting of approximately 12k trajectories on this platform, serving as the first dataset specifically designed for realistic UAV VLN tasks. To tackle the challenges posed by complex aerial environments, we propose an assistant-guided UAV object search benchmark called UAV-Need-Help, which provides varying levels of guidance information to help UAVs better accomplish realistic VLN tasks. We also propose a UAV navigation LLM that, given multi-view images, task descriptions, and assistant instructions, leverages the multimodal understanding capabilities of the MLLM to jointly process visual and textual information, and performs hierarchical trajectory generation. The evaluation results of our method significantly outperform the baseline models, while there remains a considerable gap between our results and those achieved by human operators, underscoring the challenge presented by the UAV-Need-Help task.

Towards Realistic UAV Vision-Language Navigation: Platform, Benchmark, and Methodology

TL;DR

This work introduces OpenUAV, a realistic UAV VLN platform built with UE4 and AirSim that supports continuous 6-DoF trajectories and multi-sensor data. It further provides UAV-Need-Help, a target-oriented VLN benchmark with assistant-guided navigation at multiple levels, and a UAV navigation LLM that yields hierarchical trajectories via multimodal inputs and a backtracking data-augmentation strategy. The combination enables realistic trajectory-based UAV VLN research, demonstrated by a 12k-trajectory dataset and strong performance improvements over baselines, while still leaving a gap to human operators. The work highlights avenues for autonomous UAV navigation and sim-to-real transfer to bridge simulation and real-world deployment.

Abstract

Developing agents capable of navigating to a target location based on language instructions and visual information, known as vision-language navigation (VLN), has attracted widespread interest. Most research has focused on ground-based agents, while UAV-based VLN remains relatively underexplored. Recent efforts in UAV vision-language navigation predominantly adopt ground-based VLN settings, relying on predefined discrete action spaces and neglecting the inherent disparities in agent movement dynamics and the complexity of navigation tasks between ground and aerial environments. To address these disparities and challenges, we propose solutions from three perspectives: platform, benchmark, and methodology. To enable realistic UAV trajectory simulation in VLN tasks, we propose the OpenUAV platform, which features diverse environments, realistic flight control, and extensive algorithmic support. We further construct a target-oriented VLN dataset consisting of approximately 12k trajectories on this platform, serving as the first dataset specifically designed for realistic UAV VLN tasks. To tackle the challenges posed by complex aerial environments, we propose an assistant-guided UAV object search benchmark called UAV-Need-Help, which provides varying levels of guidance information to help UAVs better accomplish realistic VLN tasks. We also propose a UAV navigation LLM that, given multi-view images, task descriptions, and assistant instructions, leverages the multimodal understanding capabilities of the MLLM to jointly process visual and textual information, and performs hierarchical trajectory generation. The evaluation results of our method significantly outperform the baseline models, while there remains a considerable gap between our results and those achieved by human operators, underscoring the challenge presented by the UAV-Need-Help task.

Paper Structure

This paper contains 25 sections, 1 equation, 9 figures, 5 tables.

Figures (9)

  • Figure 1: We propose a realistic UAV simulation platform and a novel UAV-Need-Help benchmark. The OpenUAV platform focuses on realistic UAV VLN tasks, integrating diverse environmental components, realistic flight simulations, and algorithmic support. The UAV-Need-Help benchmark introduces an assistant-guided UAV object search task, where the UAV navigates to a target object using object descriptions, environmental information, and guidance from assistants.
  • Figure 2: Overview of our dataset construction and statistical analysis. (a) Data collection pipeline for generating high-quality target descriptions and realistic UAV trajectories. (b) - (e) Statistical analysis of the dataset, covering trajectory lengths, task distances, object categories, and dataset splits. In (e), UM and UO represent Unseen Map and Unseen Object, respectively.
  • Figure 3:
  • Figure 4: Visualization of object search results of our method. First two rows demonstrate our UAV successfully follows the instruction. Notably, the third to fifth images depict the drone executing a turning maneuver, resulting in a change in the drone's perspective. The third row illustrates a failed example, depicting a collision with trees in a forest scenario.
  • Figure 4: Performance scalability with varying amounts of training data.
  • ...and 4 more figures