Exploring Spatial Representation to Enhance LLM Reasoning in Aerial Vision-Language Navigation

Yunpeng Gao; Zhigang Wang; Pengfei Han; Linglin Jing; Dong Wang; Bin Zhao

Exploring Spatial Representation to Enhance LLM Reasoning in Aerial Vision-Language Navigation

Yunpeng Gao, Zhigang Wang, Pengfei Han, Linglin Jing, Dong Wang, Bin Zhao

TL;DR

This work tackles aerial Vision-and-Language Navigation (VLN) by introducing a zero-shot framework that uses an LLM to predict UAV actions. A novel Semantic-Topo-Metric Representation (STMR) converts instruction-relevant semantic information into a growing top-down map and a 20×20 grid-based matrix prompt, enabling robust spatial reasoning. The approach yields substantial improvements over state-of-the-art baselines on simple and complex navigation tasks in both simulation and real outdoor settings, demonstrating strong zero-shot capabilities and practical UAV applicability. The method emphasizes interpretability and robustness by incorporating sub-goal tracking, spatial hashing, and a modular LLM planner, with datasets and code to be released.

Abstract

Aerial Vision-and-Language Navigation (VLN) is a novel task enabling Unmanned Aerial Vehicles (UAVs) to navigate in outdoor environments through natural language instructions and visual cues. However, it remains challenging due to the complex spatial relationships in aerial scenes.In this paper, we propose a training-free, zero-shot framework for aerial VLN tasks, where the large language model (LLM) is leveraged as the agent for action prediction. Specifically, we develop a novel Semantic-Topo-Metric Representation (STMR) to enhance the spatial reasoning capabilities of LLMs. This is achieved by extracting and projecting instruction-related semantic masks onto a top-down map, which presents spatial and topological information about surrounding landmarks and grows during the navigation process. At each step, a local map centered at the UAV is extracted from the growing top-down map, and transformed into a ma trix representation with distance metrics, serving as the text prompt to LLM for action prediction in response to the given instruction. Experiments conducted in real and simulation environments have proved the effectiveness and robustness of our method, achieving absolute success rate improvements of 26.8% and 5.8% over current state-of-the-art methods on simple and complex navigation tasks, respectively. The dataset and code will be released soon.

Exploring Spatial Representation to Enhance LLM Reasoning in Aerial Vision-Language Navigation

TL;DR

Abstract

Exploring Spatial Representation to Enhance LLM Reasoning in Aerial Vision-Language Navigation

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (7)