Table of Contents
Fetching ...

GeoNav: Empowering MLLMs with Explicit Geospatial Reasoning Abilities for Language-Goal Aerial Navigation

Haotian Xu, Yue Hu, Chen Gao, Zhengqiu Zhu, Yong Zhao, Yong Li, Quanjun Yin

TL;DR

This work tackles language-guided aerial navigation in urban environments, where long-range planning and disambiguation are challenging under partial observability. It introduces GeoNav, a three-stage framework that combines a schematic cognitive map (SCM) for global navigation and a hierarchical scene graph (HSG) for local target localization, both driven by stage-aware multimodal chain-of-thought prompting. The approach leverages prior geographic knowledge and embodied perceptions to perform cross-scale reasoning, achieving state-of-the-art results on the CityNav benchmark with substantial gains in SR and OSR and improved path efficiency. The work also provides thorough ablation analyses and qualitative cases that underscore the importance of structured spatial memory and staged reasoning for robust UAV navigation in real-world urban settings.

Abstract

Language-goal aerial navigation is a critical challenge in embodied AI, requiring UAVs to localize targets in complex environments such as urban blocks based on textual specification. Existing methods, often adapted from indoor navigation, struggle to scale due to limited field of view, semantic ambiguity among objects, and lack of structured spatial reasoning. In this work, we propose GeoNav, a geospatially aware multimodal agent to enable long-range navigation. GeoNav operates in three phases-landmark navigation, target search, and precise localization-mimicking human coarse-to-fine spatial strategies. To support such reasoning, it dynamically builds two different types of spatial memory. The first is a global but schematic cognitive map, which fuses prior textual geographic knowledge and embodied visual cues into a top-down, annotated form for fast navigation to the landmark region. The second is a local but delicate scene graph representing hierarchical spatial relationships between blocks, landmarks, and objects, which is used for definite target localization. On top of this structured representation, GeoNav employs a spatially aware, multimodal chain-of-thought prompting mechanism to enable multimodal large language models with efficient and interpretable decision-making across stages. On the CityNav urban navigation benchmark, GeoNav surpasses the current state-of-the-art by up to 12.53% in success rate and significantly improves navigation efficiency, even in hard-level tasks. Ablation studies highlight the importance of each module, showcasing how geospatial representations and coarse-to-fine reasoning enhance UAV navigation.

GeoNav: Empowering MLLMs with Explicit Geospatial Reasoning Abilities for Language-Goal Aerial Navigation

TL;DR

This work tackles language-guided aerial navigation in urban environments, where long-range planning and disambiguation are challenging under partial observability. It introduces GeoNav, a three-stage framework that combines a schematic cognitive map (SCM) for global navigation and a hierarchical scene graph (HSG) for local target localization, both driven by stage-aware multimodal chain-of-thought prompting. The approach leverages prior geographic knowledge and embodied perceptions to perform cross-scale reasoning, achieving state-of-the-art results on the CityNav benchmark with substantial gains in SR and OSR and improved path efficiency. The work also provides thorough ablation analyses and qualitative cases that underscore the importance of structured spatial memory and staged reasoning for robust UAV navigation in real-world urban settings.

Abstract

Language-goal aerial navigation is a critical challenge in embodied AI, requiring UAVs to localize targets in complex environments such as urban blocks based on textual specification. Existing methods, often adapted from indoor navigation, struggle to scale due to limited field of view, semantic ambiguity among objects, and lack of structured spatial reasoning. In this work, we propose GeoNav, a geospatially aware multimodal agent to enable long-range navigation. GeoNav operates in three phases-landmark navigation, target search, and precise localization-mimicking human coarse-to-fine spatial strategies. To support such reasoning, it dynamically builds two different types of spatial memory. The first is a global but schematic cognitive map, which fuses prior textual geographic knowledge and embodied visual cues into a top-down, annotated form for fast navigation to the landmark region. The second is a local but delicate scene graph representing hierarchical spatial relationships between blocks, landmarks, and objects, which is used for definite target localization. On top of this structured representation, GeoNav employs a spatially aware, multimodal chain-of-thought prompting mechanism to enable multimodal large language models with efficient and interpretable decision-making across stages. On the CityNav urban navigation benchmark, GeoNav surpasses the current state-of-the-art by up to 12.53% in success rate and significantly improves navigation efficiency, even in hard-level tasks. Ablation studies highlight the importance of each module, showcasing how geospatial representations and coarse-to-fine reasoning enhance UAV navigation.

Paper Structure

This paper contains 44 sections, 18 equations, 7 figures, 2 tables.

Figures (7)

  • Figure 1: Geographic and embodied data integration in a schematic cognitive map.
  • Figure 3: Navigational progress as the stages
  • Figure 4: Qualitative example of our GeoNav agent performing the language-goal aerial navigation task on the Citynav benchmark.
  • Figure 5: The multi-modal reasoning of MLLM in GeoNav
  • Figure : (a) Building the hierarchical scene graph
  • ...and 2 more figures