Table of Contents
Fetching ...

IndoorUAV: Benchmarking Vision-Language UAV Navigation in Continuous Indoor Environments

Xu Liu, Yu Liu, Hanshuo Qiu, Yang Qirong, Zhouhui Lian

TL;DR

IndoorUAV tackles the gap in vision-language navigation for indoor aerial robots by introducing a large-scale benchmark built on Habitat with 1,075 scenes and 50k+ UAV trajectories, plus two task subsets IndoorUAV-VLN (long-horizon) and IndoorUAV-VLA (short-horizon). It couples automated data generation with GPT-4o-based instruction synthesis and proposes IndoorUAV-Agent, a hierarchical framework that decomposes long instructions into sub-tasks executed by a VLA model. Experimental results reveal substantial gaps for existing models in indoor UAV VLN and demonstrate the benefit of task decomposition and multimodal reasoning, especially for long-horizon tasks. The benchmark and baseline provide a foundation for developing grounded language understanding and fine-grained 3D motion control in indoor aerial navigation.

Abstract

Vision-Language Navigation (VLN) enables agents to navigate in complex environments by following natural language instructions grounded in visual observations. Although most existing work has focused on ground-based robots or outdoor Unmanned Aerial Vehicles (UAVs), indoor UAV-based VLN remains underexplored, despite its relevance to real-world applications such as inspection, delivery, and search-and-rescue in confined spaces. To bridge this gap, we introduce \textbf{IndoorUAV}, a novel benchmark and method specifically tailored for VLN with indoor UAVs. We begin by curating over 1,000 diverse and structurally rich 3D indoor scenes from the Habitat simulator. Within these environments, we simulate realistic UAV flight dynamics to collect diverse 3D navigation trajectories manually, further enriched through data augmentation techniques. Furthermore, we design an automated annotation pipeline to generate natural language instructions of varying granularity for each trajectory. This process yields over 16,000 high-quality trajectories, comprising the \textbf{IndoorUAV-VLN} subset, which focuses on long-horizon VLN. To support short-horizon planning, we segment long trajectories into sub-trajectories by selecting semantically salient keyframes and regenerating concise instructions, forming the \textbf{IndoorUAV-VLA} subset. Finally, we introduce \textbf{IndoorUAV-Agent}, a novel navigation model designed for our benchmark, leveraging task decomposition and multimodal reasoning. We hope IndoorUAV serves as a valuable resource to advance research on vision-language embodied AI in the indoor aerial navigation domain.

IndoorUAV: Benchmarking Vision-Language UAV Navigation in Continuous Indoor Environments

TL;DR

IndoorUAV tackles the gap in vision-language navigation for indoor aerial robots by introducing a large-scale benchmark built on Habitat with 1,075 scenes and 50k+ UAV trajectories, plus two task subsets IndoorUAV-VLN (long-horizon) and IndoorUAV-VLA (short-horizon). It couples automated data generation with GPT-4o-based instruction synthesis and proposes IndoorUAV-Agent, a hierarchical framework that decomposes long instructions into sub-tasks executed by a VLA model. Experimental results reveal substantial gaps for existing models in indoor UAV VLN and demonstrate the benefit of task decomposition and multimodal reasoning, especially for long-horizon tasks. The benchmark and baseline provide a foundation for developing grounded language understanding and fine-grained 3D motion control in indoor aerial navigation.

Abstract

Vision-Language Navigation (VLN) enables agents to navigate in complex environments by following natural language instructions grounded in visual observations. Although most existing work has focused on ground-based robots or outdoor Unmanned Aerial Vehicles (UAVs), indoor UAV-based VLN remains underexplored, despite its relevance to real-world applications such as inspection, delivery, and search-and-rescue in confined spaces. To bridge this gap, we introduce \textbf{IndoorUAV}, a novel benchmark and method specifically tailored for VLN with indoor UAVs. We begin by curating over 1,000 diverse and structurally rich 3D indoor scenes from the Habitat simulator. Within these environments, we simulate realistic UAV flight dynamics to collect diverse 3D navigation trajectories manually, further enriched through data augmentation techniques. Furthermore, we design an automated annotation pipeline to generate natural language instructions of varying granularity for each trajectory. This process yields over 16,000 high-quality trajectories, comprising the \textbf{IndoorUAV-VLN} subset, which focuses on long-horizon VLN. To support short-horizon planning, we segment long trajectories into sub-trajectories by selecting semantically salient keyframes and regenerating concise instructions, forming the \textbf{IndoorUAV-VLA} subset. Finally, we introduce \textbf{IndoorUAV-Agent}, a novel navigation model designed for our benchmark, leveraging task decomposition and multimodal reasoning. We hope IndoorUAV serves as a valuable resource to advance research on vision-language embodied AI in the indoor aerial navigation domain.

Paper Structure

This paper contains 31 sections, 5 equations, 9 figures, 7 tables.

Figures (9)

  • Figure 1: Illustration of IndoorUAV-VLN (upper) and IndoorUAV-VLA (lower) datasets. Long-horizon VLN tasks typically involve complex instructions and longer trajectory lengths, while VLA tasks focus on fine-grained maneuver execution, consisting of 1-3 executable actions.
  • Figure 2: Overview of the IndoorUAV data collection and instruction generation pipeline.
  • Figure 3: Statistical analysis of the IndoorUAV benchmark. (a) Action distributions. (b) trajectory length distributions.
  • Figure 4: For the long-horizon VLN task, we first use GPT-4o to decompose the long instruction into $n$ shorter VLA-style instructions as subtasks, each containing 1 to 3 actions. We then process each subtask sequentially using a VLA model based on the $\bm{\pi_{0}}$ architecture.
  • Figure 5: Visualization on both IndoorUAV-VLA and IndoorUAV-VLN. The upper two are VLA tasks, with medium/hard (2-3 executable actions) difficulty, respectively. The lower is a VLN task where the markers S and T in the trajectory plot indicate the start position and target position of the trajectory, respectively.
  • ...and 4 more figures