Table of Contents
Fetching ...

Exploring Spatial Representation to Enhance LLM Reasoning in Aerial Vision-Language Navigation

Yunpeng Gao, Zhigang Wang, Pengfei Han, Linglin Jing, Dong Wang, Bin Zhao

TL;DR

This work tackles aerial Vision-and-Language Navigation (VLN) by introducing a zero-shot framework that uses an LLM to predict UAV actions. A novel Semantic-Topo-Metric Representation (STMR) converts instruction-relevant semantic information into a growing top-down map and a 20×20 grid-based matrix prompt, enabling robust spatial reasoning. The approach yields substantial improvements over state-of-the-art baselines on simple and complex navigation tasks in both simulation and real outdoor settings, demonstrating strong zero-shot capabilities and practical UAV applicability. The method emphasizes interpretability and robustness by incorporating sub-goal tracking, spatial hashing, and a modular LLM planner, with datasets and code to be released.

Abstract

Aerial Vision-and-Language Navigation (VLN) is a novel task enabling Unmanned Aerial Vehicles (UAVs) to navigate in outdoor environments through natural language instructions and visual cues. However, it remains challenging due to the complex spatial relationships in aerial scenes.In this paper, we propose a training-free, zero-shot framework for aerial VLN tasks, where the large language model (LLM) is leveraged as the agent for action prediction. Specifically, we develop a novel Semantic-Topo-Metric Representation (STMR) to enhance the spatial reasoning capabilities of LLMs. This is achieved by extracting and projecting instruction-related semantic masks onto a top-down map, which presents spatial and topological information about surrounding landmarks and grows during the navigation process. At each step, a local map centered at the UAV is extracted from the growing top-down map, and transformed into a ma trix representation with distance metrics, serving as the text prompt to LLM for action prediction in response to the given instruction. Experiments conducted in real and simulation environments have proved the effectiveness and robustness of our method, achieving absolute success rate improvements of 26.8% and 5.8% over current state-of-the-art methods on simple and complex navigation tasks, respectively. The dataset and code will be released soon.

Exploring Spatial Representation to Enhance LLM Reasoning in Aerial Vision-Language Navigation

TL;DR

This work tackles aerial Vision-and-Language Navigation (VLN) by introducing a zero-shot framework that uses an LLM to predict UAV actions. A novel Semantic-Topo-Metric Representation (STMR) converts instruction-relevant semantic information into a growing top-down map and a 20×20 grid-based matrix prompt, enabling robust spatial reasoning. The approach yields substantial improvements over state-of-the-art baselines on simple and complex navigation tasks in both simulation and real outdoor settings, demonstrating strong zero-shot capabilities and practical UAV applicability. The method emphasizes interpretability and robustness by incorporating sub-goal tracking, spatial hashing, and a modular LLM planner, with datasets and code to be released.

Abstract

Aerial Vision-and-Language Navigation (VLN) is a novel task enabling Unmanned Aerial Vehicles (UAVs) to navigate in outdoor environments through natural language instructions and visual cues. However, it remains challenging due to the complex spatial relationships in aerial scenes.In this paper, we propose a training-free, zero-shot framework for aerial VLN tasks, where the large language model (LLM) is leveraged as the agent for action prediction. Specifically, we develop a novel Semantic-Topo-Metric Representation (STMR) to enhance the spatial reasoning capabilities of LLMs. This is achieved by extracting and projecting instruction-related semantic masks onto a top-down map, which presents spatial and topological information about surrounding landmarks and grows during the navigation process. At each step, a local map centered at the UAV is extracted from the growing top-down map, and transformed into a ma trix representation with distance metrics, serving as the text prompt to LLM for action prediction in response to the given instruction. Experiments conducted in real and simulation environments have proved the effectiveness and robustness of our method, achieving absolute success rate improvements of 26.8% and 5.8% over current state-of-the-art methods on simple and complex navigation tasks, respectively. The dataset and code will be released soon.

Paper Structure

This paper contains 30 sections, 4 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: The pipeline to obtain STMR. (a) The observed RGB image, the corresponding segmented image, and the depth image. (b) Segmented images are projected into the top-down map gradually during the UAV flight, which captures the semantic and topological information of the environment. (c) The top-down map is further transformed into a 20x20 matrix representation with distance metrics for LLM reasoning.
  • Figure 2: Our method consists of three modules, i.e., Sub-Goal Extraction, Semantic-Topo-Metric Representation, and LLM planner. They are utilized to generate sub-goal instructions, spatial information representations, and UAV navigation actions, respectively.
  • Figure 3: 2D Visual Perceptor for the UAV.
  • Figure 4: Demonstration of the STMR in spatial reasoning.
  • Figure 5: Failure cases of our method.
  • ...and 2 more figures