Towards Accurate UAV Image Perception: Guiding Vision-Language Models with Stronger Task Prompts
Mingning Guo, Mengwei Wu, Shaoxian Li, Haifeng Li, Chao Tao
TL;DR
The paper tackles UAV image perception challenges caused by complex scenes and varying viewpoints by shifting from model-centric to prompt-centric strategies. It introduces AerialVP, a training-free agent that analyzes tasks, selects specialized tools, and generates enhanced prompts to guide vision–language models, along with AerialSense, a large UAV multi-task benchmark (VR, VQA, VG). Across open-source and proprietary models, enhanced prompts yield substantial gains, especially in grounding and multi-step reasoning, with attention-heatmap evidence supporting improved cross-modal alignment. The work demonstrates a scalable, modular approach for robust UAV perception and provides a rich dataset for evaluating generalization under diverse conditions, while outlining avenues for expanding tool coverage and improving generalizability.
Abstract
Existing image perception methods based on VLMs generally follow a paradigm wherein models extract and analyze image content based on user-provided textual task prompts. However, such methods face limitations when applied to UAV imagery, which presents challenges like target confusion, scale variations, and complex backgrounds. These challenges arise because VLMs' understanding of image content depends on the semantic alignment between visual and textual tokens. When the task prompt is simplistic and the image content is complex, achieving effective alignment becomes difficult, limiting the model's ability to focus on task-relevant information. To address this issue, we introduce AerialVP, the first agent framework for task prompt enhancement in UAV image perception. AerialVP proactively extracts multi-dimensional auxiliary information from UAV images to enhance task prompts, overcoming the limitations of traditional VLM-based approaches. Specifically, the enhancement process includes three stages: (1) analyzing the task prompt to identify the task type and enhancement needs, (2) selecting appropriate tools from the tool repository, and (3) generating enhanced task prompts based on the analysis and selected tools. To evaluate AerialVP, we introduce AerialSense, a comprehensive benchmark for UAV image perception that includes Aerial Visual Reasoning, Aerial Visual Question Answering, and Aerial Visual Grounding tasks. AerialSense provides a standardized basis for evaluating model generalization and performance across diverse resolutions, lighting conditions, and both urban and natural scenes. Experimental results demonstrate that AerialVP significantly enhances task prompt guidance, leading to stable and substantial performance improvements in both open-source and proprietary VLMs. Our work will be available at https://github.com/lostwolves/AerialVP.
