Table of Contents
Fetching ...

Hierarchical Language Models for Semantic Navigation and Manipulation in an Aerial-Ground Robotic System

Haokun Liu, Zhaoqi Ma, Yunong Li, Junichiro Sugihara, Yicheng Chen, Jinjie Li, Moju Zhao

TL;DR

This work tackles the challenge of generalizable, robust coordination in heterogeneous aerial–ground robot teams. It proposes a three-layer hierarchical MA-LLM framework that ties high-level reasoning and global semantic mapping (via an LLM) to precise perception (via a GridMask-tuned VLM) and a pre-programmed execution layer that governs motion functions, enabling cooperative navigation and manipulation. A key contribution is GridMask-based fine-tuning, which yields up to a 78% reduction in localization error and supports reliable fine-grained manipulation on bird-view imagery, facilitating long-horizon planning and coordination. Extensive simulation and real-world experiments demonstrate zero-shot generalization, robust semantic navigation, and reliable manipulation, highlighting the framework’s potential for scalable, adaptable robot collaboration in dynamic environments; however, latency and perception in clutter remain areas for improvement.

Abstract

Heterogeneous multirobot systems show great potential in complex tasks requiring coordinated hybrid cooperation. However, existing methods that rely on static or task-specific models often lack generalizability across diverse tasks and dynamic environments. This highlights the need for generalizable intelligence that can bridge high-level reasoning with low-level execution across heterogeneous agents. To address this, we propose a hierarchical multimodal framework that integrates a prompted large language model (LLM) with a fine-tuned vision-language model (VLM). At the system level, the LLM performs hierarchical task decomposition and constructs a global semantic map, while the VLM provides semantic perception and object localization, where the proposed GridMask significantly enhances the VLM's spatial accuracy for reliable fine-grained manipulation. The aerial robot leverages this global map to generate semantic paths and guide the ground robot's local navigation and manipulation, ensuring robust coordination even in target-absent or ambiguous scenarios. We validate the framework through extensive simulation and real-world experiments on long-horizon object arrangement tasks, demonstrating zero-shot adaptability, robust semantic navigation, and reliable manipulation in dynamic environments. To the best of our knowledge, this work presents the first heterogeneous aerial-ground robotic system that integrates VLM-based perception with LLM-driven reasoning for global high-level task planning and execution.

Hierarchical Language Models for Semantic Navigation and Manipulation in an Aerial-Ground Robotic System

TL;DR

This work tackles the challenge of generalizable, robust coordination in heterogeneous aerial–ground robot teams. It proposes a three-layer hierarchical MA-LLM framework that ties high-level reasoning and global semantic mapping (via an LLM) to precise perception (via a GridMask-tuned VLM) and a pre-programmed execution layer that governs motion functions, enabling cooperative navigation and manipulation. A key contribution is GridMask-based fine-tuning, which yields up to a 78% reduction in localization error and supports reliable fine-grained manipulation on bird-view imagery, facilitating long-horizon planning and coordination. Extensive simulation and real-world experiments demonstrate zero-shot generalization, robust semantic navigation, and reliable manipulation, highlighting the framework’s potential for scalable, adaptable robot collaboration in dynamic environments; however, latency and perception in clutter remain areas for improvement.

Abstract

Heterogeneous multirobot systems show great potential in complex tasks requiring coordinated hybrid cooperation. However, existing methods that rely on static or task-specific models often lack generalizability across diverse tasks and dynamic environments. This highlights the need for generalizable intelligence that can bridge high-level reasoning with low-level execution across heterogeneous agents. To address this, we propose a hierarchical multimodal framework that integrates a prompted large language model (LLM) with a fine-tuned vision-language model (VLM). At the system level, the LLM performs hierarchical task decomposition and constructs a global semantic map, while the VLM provides semantic perception and object localization, where the proposed GridMask significantly enhances the VLM's spatial accuracy for reliable fine-grained manipulation. The aerial robot leverages this global map to generate semantic paths and guide the ground robot's local navigation and manipulation, ensuring robust coordination even in target-absent or ambiguous scenarios. We validate the framework through extensive simulation and real-world experiments on long-horizon object arrangement tasks, demonstrating zero-shot adaptability, robust semantic navigation, and reliable manipulation in dynamic environments. To the best of our knowledge, this work presents the first heterogeneous aerial-ground robotic system that integrates VLM-based perception with LLM-driven reasoning for global high-level task planning and execution.

Paper Structure

This paper contains 33 sections, 10 equations, 10 figures, 8 tables.

Figures (10)

  • Figure 1: Overview of the proposed hierarchical language model framework integrated into an aerial-ground robotic system. In the sub-task "move to (XX, XX)", the aerial robot follows an optimized global path while continuously capturing bird-view images. These images are processed into semantic information that guides the ground robot’s real-time local navigation, implicitly allowing the ground robot to follow the aerial robot’s position.
  • Figure 2: An overview of the workflow for a long-horizon task using the hierarchical MA-LLM framework. The task starts with the instruction: "Assemble the word OK, but do not move K." 1) The LLM decomposes the command and maps sub-tasks to motion functions for the aerial and ground robots. 2) The aerial robot visits multiple viewpoints to collect local maps, which the LLM integrates into a global semantic map. 3) Once the map is ready, both robots coordinate to reach the "O" cube. 4) The aerial robot follows a task-specific global path, while the VLM processes GridMask-enhanced bird-view images. 5) The ground robot uses this semantic input from the VLM to complete its assigned sub-task via local planning.
  • Figure 3: An illustration of the fine-tuning dataset, showing the system prompt, user instruction with GridMask-based bird-view image input, and the ideal output in structured JSON format.
  • Figure 4: A workflow of the quad_construct_map() function. Local semantic maps derived from aerial images are integrated by the LLM to form a semantic map, which is subsequently updated with 15s interval during task execution.
  • Figure 5: Performance comparison of different models based on objects' average Euclidean deviation from ground-truth positions in the image (in grid units).
  • ...and 5 more figures