AERMANI-VLM: Structured Prompting and Reasoning for Aerial Manipulation with Vision Language Models
Sarthak Mishra, Rishabh Dev Yadav, Avirup Das, Saksham Gupta, Wei Pan, Spandan Roy
TL;DR
This work tackles the safety and reliability gap in using vision–language models for aerial manipulation by decoupling high-level reasoning from low-level control. It introduces AERMANI-VLM, a structured prompting framework that uses a Descriptive Reasoning Trace (DRT) and a flight-safe skill library to convert natural-language commands into deterministic, auditable actions within a POMDP setting. The approach demonstrates robust zero-shot manipulation in simulation and hardware, outperforming several baselines and ablations by constraining reasoning and ensuring flight-safe execution. The results highlight that structured input and output are critical for grounding perception, maintaining temporal coherence, and preventing hallucinations in aerial tasks. This framework paves the way for safe, interpretable language-guided autonomy in aerial robots without task-specific retraining.
Abstract
The rapid progress of vision--language models (VLMs) has sparked growing interest in robotic control, where natural language can express the operation goals while visual feedback links perception to action. However, directly deploying VLM-driven policies on aerial manipulators remains unsafe and unreliable since the generated actions are often inconsistent, hallucination-prone, and dynamically infeasible for flight. In this work, we present AERMANI-VLM, the first framework to adapt pretrained VLMs for aerial manipulation by separating high-level reasoning from low-level control, without any task-specific fine-tuning. Our framework encodes natural language instructions, task context, and safety constraints into a structured prompt that guides the model to generate a step-by-step reasoning trace in natural language. This reasoning output is used to select from a predefined library of discrete, flight-safe skills, ensuring interpretable and temporally consistent execution. By decoupling symbolic reasoning from physical action, AERMANI-VLM mitigates hallucinated commands and prevents unsafe behavior, enabling robust task completion. We validate the framework in both simulation and hardware on diverse multi-step pick-and-place tasks, demonstrating strong generalization to previously unseen commands, objects, and environments.
