Table of Contents
Fetching ...

VLN-Pilot: Large Vision-Language Model as an Autonomous Indoor Drone Operator

Bessie Dominguez-Dager, Sergio Suescun-Ferrandiz, Felix Escalona, Francisco Gomez-Donoso, Miguel Cazorla

TL;DR

VLN-Pilot integrates a Vision-Language Large Model with a rule-based finite-state machine to autonomously navigate indoor environments using only visual inputs and a topological map. The approach leverages a Unity-based simulator to evaluate high-level planning by the VLLM and low-level drone control via a state machine, reducing human workload while maintaining safe navigation. A comparative study between GPT and Gemini demonstrates GPT's practicality in groundings and doorway crossings, while revealing prompts-induced fragility and the need for volumetric spatial awareness. Overall, the work points to a scalable, human-friendly paradigm for indoor UAV autonomy, with clear avenues to improve real-world transfer and spatial grounding.

Abstract

This paper introduces VLN-Pilot, a novel framework in which a large Vision-and-Language Model (VLLM) assumes the role of a human pilot for indoor drone navigation. By leveraging the multimodal reasoning abilities of VLLMs, VLN-Pilot interprets free-form natural language instructions and grounds them in visual observations to plan and execute drone trajectories in GPS-denied indoor environments. Unlike traditional rule-based or geometric path-planning approaches, our framework integrates language-driven semantic understanding with visual perception, enabling context-aware, high-level flight behaviors with minimal task-specific engineering. VLN-Pilot supports fully autonomous instruction-following for drones by reasoning about spatial relationships, obstacle avoidance, and dynamic reactivity to unforeseen events. We validate our framework on a custom photorealistic indoor simulation benchmark and demonstrate the ability of the VLLM-driven agent to achieve high success rates on complex instruction-following tasks, including long-horizon navigation with multiple semantic targets. Experimental results highlight the promise of replacing remote drone pilots with a language-guided autonomous agent, opening avenues for scalable, human-friendly control of indoor UAVs in tasks such as inspection, search-and-rescue, and facility monitoring. Our results suggest that VLLM-based pilots may dramatically reduce operator workload while improving safety and mission flexibility in constrained indoor environments.

VLN-Pilot: Large Vision-Language Model as an Autonomous Indoor Drone Operator

TL;DR

VLN-Pilot integrates a Vision-Language Large Model with a rule-based finite-state machine to autonomously navigate indoor environments using only visual inputs and a topological map. The approach leverages a Unity-based simulator to evaluate high-level planning by the VLLM and low-level drone control via a state machine, reducing human workload while maintaining safe navigation. A comparative study between GPT and Gemini demonstrates GPT's practicality in groundings and doorway crossings, while revealing prompts-induced fragility and the need for volumetric spatial awareness. Overall, the work points to a scalable, human-friendly paradigm for indoor UAV autonomy, with clear avenues to improve real-world transfer and spatial grounding.

Abstract

This paper introduces VLN-Pilot, a novel framework in which a large Vision-and-Language Model (VLLM) assumes the role of a human pilot for indoor drone navigation. By leveraging the multimodal reasoning abilities of VLLMs, VLN-Pilot interprets free-form natural language instructions and grounds them in visual observations to plan and execute drone trajectories in GPS-denied indoor environments. Unlike traditional rule-based or geometric path-planning approaches, our framework integrates language-driven semantic understanding with visual perception, enabling context-aware, high-level flight behaviors with minimal task-specific engineering. VLN-Pilot supports fully autonomous instruction-following for drones by reasoning about spatial relationships, obstacle avoidance, and dynamic reactivity to unforeseen events. We validate our framework on a custom photorealistic indoor simulation benchmark and demonstrate the ability of the VLLM-driven agent to achieve high success rates on complex instruction-following tasks, including long-horizon navigation with multiple semantic targets. Experimental results highlight the promise of replacing remote drone pilots with a language-guided autonomous agent, opening avenues for scalable, human-friendly control of indoor UAVs in tasks such as inspection, search-and-rescue, and facility monitoring. Our results suggest that VLLM-based pilots may dramatically reduce operator workload while improving safety and mission flexibility in constrained indoor environments.
Paper Structure (15 sections, 9 figures, 4 tables)

This paper contains 15 sections, 9 figures, 4 tables.

Figures (9)

  • Figure 1: System architecture showing the closed-loop pipeline between the Unity drone simulator, the Python controller, and the VLLM.
  • Figure 2: Simulated drone (left) and camera configuration, including front (middle) and rear (right) RGB views. The red, green, and blue arrows correspond to the X, Y, and Z axes, respectively.
  • Figure 3: Layout of the simulated furnished cabin environment, with the living room on the left, the bedroom in the center, and the bathroom on the right.
  • Figure 4: FSM diagram of the drone controller, showing the main execution states (rectangles) and the transition conditions (diamonds) evaluated through queries to the VLLM.
  • Figure 5: Room diagrams of the simulator. The living room is highlighted in green, the bedroom in magenta, and the bathroom in yellow.
  • ...and 4 more figures