GeoNav: Empowering MLLMs with Explicit Geospatial Reasoning Abilities for Language-Goal Aerial Navigation
Haotian Xu, Yue Hu, Chen Gao, Zhengqiu Zhu, Yong Zhao, Yong Li, Quanjun Yin
TL;DR
This work tackles language-guided aerial navigation in urban environments, where long-range planning and disambiguation are challenging under partial observability. It introduces GeoNav, a three-stage framework that combines a schematic cognitive map (SCM) for global navigation and a hierarchical scene graph (HSG) for local target localization, both driven by stage-aware multimodal chain-of-thought prompting. The approach leverages prior geographic knowledge and embodied perceptions to perform cross-scale reasoning, achieving state-of-the-art results on the CityNav benchmark with substantial gains in SR and OSR and improved path efficiency. The work also provides thorough ablation analyses and qualitative cases that underscore the importance of structured spatial memory and staged reasoning for robust UAV navigation in real-world urban settings.
Abstract
Language-goal aerial navigation is a critical challenge in embodied AI, requiring UAVs to localize targets in complex environments such as urban blocks based on textual specification. Existing methods, often adapted from indoor navigation, struggle to scale due to limited field of view, semantic ambiguity among objects, and lack of structured spatial reasoning. In this work, we propose GeoNav, a geospatially aware multimodal agent to enable long-range navigation. GeoNav operates in three phases-landmark navigation, target search, and precise localization-mimicking human coarse-to-fine spatial strategies. To support such reasoning, it dynamically builds two different types of spatial memory. The first is a global but schematic cognitive map, which fuses prior textual geographic knowledge and embodied visual cues into a top-down, annotated form for fast navigation to the landmark region. The second is a local but delicate scene graph representing hierarchical spatial relationships between blocks, landmarks, and objects, which is used for definite target localization. On top of this structured representation, GeoNav employs a spatially aware, multimodal chain-of-thought prompting mechanism to enable multimodal large language models with efficient and interpretable decision-making across stages. On the CityNav urban navigation benchmark, GeoNav surpasses the current state-of-the-art by up to 12.53% in success rate and significantly improves navigation efficiency, even in hard-level tasks. Ablation studies highlight the importance of each module, showcasing how geospatial representations and coarse-to-fine reasoning enhance UAV navigation.
