Table of Contents
Fetching ...

SoraNav: Adaptive UAV Task-Centric Navigation via Zeroshot VLM Reasoning

Hongyu Song, Rishabh Dev Yadav, Cheng Guo, Wei Pan

TL;DR

SoraNav tackles the challenge of instruction-driven UAV navigation in unknown, compact 3D environments by marrying zero-shot VLM reasoning with geometry-aware decision making. It introduces Multi-modal Visual Annotation (MVA), depth-aligned sensing, and multi-layer occupancy maps to ground semantic reasoning in real geometry, and it employs a roadmap hypergraph with adaptive decision making (ADM) to switch between VLM-driven and geometry-driven exploration. The approach yields significant improvements in navigation success and efficiency in both 2.5D and 3D settings, and it demonstrates transferability across multiple large VLMs, including real-world deployment on an actual UAV. The work provides a PX4-based hardware-software platform and a digital twin for reproducible evaluation, highlighting practical impact for robust, instruction-guided aerial robotics while outlining avenues for motion-aware and semantically enriched future work.

Abstract

Interpreting visual observations and natural language instructions for complex task execution remains a key challenge in robotics and AI. Despite recent advances, language-driven navigation is still difficult, particularly for UAVs in small-scale 3D environments. Existing Vision-Language Navigation (VLN) approaches are mostly designed for ground robots and struggle to generalize to aerial tasks that require full 3D spatial reasoning. The emergence of large Vision-Language Models (VLMs), such as GPT and Claude, enables zero-shot semantic reasoning from visual and textual inputs. However, these models lack spatial grounding and are not directly applicable to navigation. To address these limitations, SoraNav is introduced, an adaptive UAV navigation framework that integrates zero-shot VLM reasoning with geometry-aware decision-making. Geometric priors are incorporated into image annotations to constrain the VLM action space and improve decision quality. A hybrid switching strategy leverages navigation history to alternate between VLM reasoning and geometry-based exploration, mitigating dead-ends and redundant revisits. A PX4-based hardware-software platform, comprising both a digital twin and a physical micro-UAV, enables reproducible evaluation. Experimental results show that in 2.5D scenarios, our method improves Success Rate (SR) by 25.7% and Success weighted by Path Length (SPL) by 17%. In 3D scenarios, it improves SR by 29.5% and SPL by 18.5% relative to the baseline.

SoraNav: Adaptive UAV Task-Centric Navigation via Zeroshot VLM Reasoning

TL;DR

SoraNav tackles the challenge of instruction-driven UAV navigation in unknown, compact 3D environments by marrying zero-shot VLM reasoning with geometry-aware decision making. It introduces Multi-modal Visual Annotation (MVA), depth-aligned sensing, and multi-layer occupancy maps to ground semantic reasoning in real geometry, and it employs a roadmap hypergraph with adaptive decision making (ADM) to switch between VLM-driven and geometry-driven exploration. The approach yields significant improvements in navigation success and efficiency in both 2.5D and 3D settings, and it demonstrates transferability across multiple large VLMs, including real-world deployment on an actual UAV. The work provides a PX4-based hardware-software platform and a digital twin for reproducible evaluation, highlighting practical impact for robust, instruction-guided aerial robotics while outlining avenues for motion-aware and semantically enriched future work.

Abstract

Interpreting visual observations and natural language instructions for complex task execution remains a key challenge in robotics and AI. Despite recent advances, language-driven navigation is still difficult, particularly for UAVs in small-scale 3D environments. Existing Vision-Language Navigation (VLN) approaches are mostly designed for ground robots and struggle to generalize to aerial tasks that require full 3D spatial reasoning. The emergence of large Vision-Language Models (VLMs), such as GPT and Claude, enables zero-shot semantic reasoning from visual and textual inputs. However, these models lack spatial grounding and are not directly applicable to navigation. To address these limitations, SoraNav is introduced, an adaptive UAV navigation framework that integrates zero-shot VLM reasoning with geometry-aware decision-making. Geometric priors are incorporated into image annotations to constrain the VLM action space and improve decision quality. A hybrid switching strategy leverages navigation history to alternate between VLM reasoning and geometry-based exploration, mitigating dead-ends and redundant revisits. A PX4-based hardware-software platform, comprising both a digital twin and a physical micro-UAV, enables reproducible evaluation. Experimental results show that in 2.5D scenarios, our method improves Success Rate (SR) by 25.7% and Success weighted by Path Length (SPL) by 17%. In 3D scenarios, it improves SR by 29.5% and SPL by 18.5% relative to the baseline.

Paper Structure

This paper contains 29 sections, 17 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Illustration of the System Overview and Data Flow.
  • Figure 2: Illustration of Anchors and Visual Annotations.
  • Figure 3: Pipeline of Adaptive Decision Making. The pipeline illustrates how multi-modal prompting is employed to guide the Large VLM in reasoning and decision generation. Example reasoning outputs from the VLM are shown for context. A roadmap hypergraph is then used to validate the effectiveness of VLM decisions, enabling a transition from uninformative VLM decisions to geometric strategies.
  • Figure 4: Custom UAV platform designed for ZSVTN.
  • Figure 5: Real-World (Left) and Simulated (Right) Flight Scenes.
  • ...and 1 more figures