Table of Contents
Fetching ...

AERMANI-VLM: Structured Prompting and Reasoning for Aerial Manipulation with Vision Language Models

Sarthak Mishra, Rishabh Dev Yadav, Avirup Das, Saksham Gupta, Wei Pan, Spandan Roy

TL;DR

This work tackles the safety and reliability gap in using vision–language models for aerial manipulation by decoupling high-level reasoning from low-level control. It introduces AERMANI-VLM, a structured prompting framework that uses a Descriptive Reasoning Trace (DRT) and a flight-safe skill library to convert natural-language commands into deterministic, auditable actions within a POMDP setting. The approach demonstrates robust zero-shot manipulation in simulation and hardware, outperforming several baselines and ablations by constraining reasoning and ensuring flight-safe execution. The results highlight that structured input and output are critical for grounding perception, maintaining temporal coherence, and preventing hallucinations in aerial tasks. This framework paves the way for safe, interpretable language-guided autonomy in aerial robots without task-specific retraining.

Abstract

The rapid progress of vision--language models (VLMs) has sparked growing interest in robotic control, where natural language can express the operation goals while visual feedback links perception to action. However, directly deploying VLM-driven policies on aerial manipulators remains unsafe and unreliable since the generated actions are often inconsistent, hallucination-prone, and dynamically infeasible for flight. In this work, we present AERMANI-VLM, the first framework to adapt pretrained VLMs for aerial manipulation by separating high-level reasoning from low-level control, without any task-specific fine-tuning. Our framework encodes natural language instructions, task context, and safety constraints into a structured prompt that guides the model to generate a step-by-step reasoning trace in natural language. This reasoning output is used to select from a predefined library of discrete, flight-safe skills, ensuring interpretable and temporally consistent execution. By decoupling symbolic reasoning from physical action, AERMANI-VLM mitigates hallucinated commands and prevents unsafe behavior, enabling robust task completion. We validate the framework in both simulation and hardware on diverse multi-step pick-and-place tasks, demonstrating strong generalization to previously unseen commands, objects, and environments.

AERMANI-VLM: Structured Prompting and Reasoning for Aerial Manipulation with Vision Language Models

TL;DR

This work tackles the safety and reliability gap in using vision–language models for aerial manipulation by decoupling high-level reasoning from low-level control. It introduces AERMANI-VLM, a structured prompting framework that uses a Descriptive Reasoning Trace (DRT) and a flight-safe skill library to convert natural-language commands into deterministic, auditable actions within a POMDP setting. The approach demonstrates robust zero-shot manipulation in simulation and hardware, outperforming several baselines and ablations by constraining reasoning and ensuring flight-safe execution. The results highlight that structured input and output are critical for grounding perception, maintaining temporal coherence, and preventing hallucinations in aerial tasks. This framework paves the way for safe, interpretable language-guided autonomy in aerial robots without task-specific retraining.

Abstract

The rapid progress of vision--language models (VLMs) has sparked growing interest in robotic control, where natural language can express the operation goals while visual feedback links perception to action. However, directly deploying VLM-driven policies on aerial manipulators remains unsafe and unreliable since the generated actions are often inconsistent, hallucination-prone, and dynamically infeasible for flight. In this work, we present AERMANI-VLM, the first framework to adapt pretrained VLMs for aerial manipulation by separating high-level reasoning from low-level control, without any task-specific fine-tuning. Our framework encodes natural language instructions, task context, and safety constraints into a structured prompt that guides the model to generate a step-by-step reasoning trace in natural language. This reasoning output is used to select from a predefined library of discrete, flight-safe skills, ensuring interpretable and temporally consistent execution. By decoupling symbolic reasoning from physical action, AERMANI-VLM mitigates hallucinated commands and prevents unsafe behavior, enabling robust task completion. We validate the framework in both simulation and hardware on diverse multi-step pick-and-place tasks, demonstrating strong generalization to previously unseen commands, objects, and environments.

Paper Structure

This paper contains 21 sections, 6 equations, 7 figures, 7 tables, 1 algorithm.

Figures (7)

  • Figure 1: Overview of the AERMANI-VLM pipeline for vision-language-guided aerial manipulation. (1) Input: a user provides a natural language command. (2) Prompt compilation: the command is compiled into a structured prompt containing a preamble, reasoning history, skill definitions, and safety rules. (3) VLM inference: together with the current RGB observation, the prompt is processed by a pretrained VLM, which outputs an image description, task summary, explicit reasoning trace, and a discrete skill to execute. (4) Skill execution: the selected motion primitive or perception-driven routine is executed by deterministic low-level controllers, ensuring repeatability under flight dynamics. This reasoning–action loop continues until task completion, enabling the VLM to focus on semantic reasoning while delegating precise execution to robust controllers.
  • Figure 2: Coordinate frames and spatial grounding in AERMANI-VLM. (i) The global world frame $T_W$ anchors all transformations, defining poses for the aerial manipulator ($^{W}T_{AM}$), target object ($^{W}T_O$), and placement location ($^{W}T_{TP}$). (ii) Onboard frames for the camera ($^{AM}T_C$) and gripper ($^{AM}T_G$) are expressed relative to the manipulator body, maintaining consistency between perception and control.
  • Figure 3: Simulation Environment consisting of a custom indoor office setup with randomized layouts of tables, shelves, and other furniture.
  • Figure 4: Qualitative results from a real-world hardware experiment for the command: "Pick up the purple cup next to the coffee machine and place it on the wooden table." Each panel shows the first-person onboard view (large bottom image) and two static third-person views (small top images). The numbered sequence illustrates the complete, autonomous execution of the task. (1--3) The AM performs an active search to find the target object. (4--5) It executes a visually-guided approach and grasp. (6--8) After securing the cup, it searches for the destination table. (9--10) Finally, it approaches the table and places the object.
  • Figure 5: The open-vocabulary perception pipeline for object_localization (top row) and placement_localization (bottom row). Given a natural language query (e.g., "Purple Cup"), CLIPSeg performs zero-shot segmentation on the input RGB image. This 2D mask is then used to extract a filtered 3D point cloud from the depth data, allowing for the precise calculation of a grasp or placement pose.
  • ...and 2 more figures