Hierarchical Language Models for Semantic Navigation and Manipulation in an Aerial-Ground Robotic System

Haokun Liu; Zhaoqi Ma; Yunong Li; Junichiro Sugihara; Yicheng Chen; Jinjie Li; Moju Zhao

Hierarchical Language Models for Semantic Navigation and Manipulation in an Aerial-Ground Robotic System

Haokun Liu, Zhaoqi Ma, Yunong Li, Junichiro Sugihara, Yicheng Chen, Jinjie Li, Moju Zhao

TL;DR

This work tackles the challenge of generalizable, robust coordination in heterogeneous aerial–ground robot teams. It proposes a three-layer hierarchical MA-LLM framework that ties high-level reasoning and global semantic mapping (via an LLM) to precise perception (via a GridMask-tuned VLM) and a pre-programmed execution layer that governs motion functions, enabling cooperative navigation and manipulation. A key contribution is GridMask-based fine-tuning, which yields up to a 78% reduction in localization error and supports reliable fine-grained manipulation on bird-view imagery, facilitating long-horizon planning and coordination. Extensive simulation and real-world experiments demonstrate zero-shot generalization, robust semantic navigation, and reliable manipulation, highlighting the framework’s potential for scalable, adaptable robot collaboration in dynamic environments; however, latency and perception in clutter remain areas for improvement.

Abstract

Heterogeneous multirobot systems show great potential in complex tasks requiring coordinated hybrid cooperation. However, existing methods that rely on static or task-specific models often lack generalizability across diverse tasks and dynamic environments. This highlights the need for generalizable intelligence that can bridge high-level reasoning with low-level execution across heterogeneous agents. To address this, we propose a hierarchical multimodal framework that integrates a prompted large language model (LLM) with a fine-tuned vision-language model (VLM). At the system level, the LLM performs hierarchical task decomposition and constructs a global semantic map, while the VLM provides semantic perception and object localization, where the proposed GridMask significantly enhances the VLM's spatial accuracy for reliable fine-grained manipulation. The aerial robot leverages this global map to generate semantic paths and guide the ground robot's local navigation and manipulation, ensuring robust coordination even in target-absent or ambiguous scenarios. We validate the framework through extensive simulation and real-world experiments on long-horizon object arrangement tasks, demonstrating zero-shot adaptability, robust semantic navigation, and reliable manipulation in dynamic environments. To the best of our knowledge, this work presents the first heterogeneous aerial-ground robotic system that integrates VLM-based perception with LLM-driven reasoning for global high-level task planning and execution.

Hierarchical Language Models for Semantic Navigation and Manipulation in an Aerial-Ground Robotic System

TL;DR

Abstract

Hierarchical Language Models for Semantic Navigation and Manipulation in an Aerial-Ground Robotic System

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (10)