Table of Contents
Fetching ...

Towards Accurate UAV Image Perception: Guiding Vision-Language Models with Stronger Task Prompts

Mingning Guo, Mengwei Wu, Shaoxian Li, Haifeng Li, Chao Tao

TL;DR

The paper tackles UAV image perception challenges caused by complex scenes and varying viewpoints by shifting from model-centric to prompt-centric strategies. It introduces AerialVP, a training-free agent that analyzes tasks, selects specialized tools, and generates enhanced prompts to guide vision–language models, along with AerialSense, a large UAV multi-task benchmark (VR, VQA, VG). Across open-source and proprietary models, enhanced prompts yield substantial gains, especially in grounding and multi-step reasoning, with attention-heatmap evidence supporting improved cross-modal alignment. The work demonstrates a scalable, modular approach for robust UAV perception and provides a rich dataset for evaluating generalization under diverse conditions, while outlining avenues for expanding tool coverage and improving generalizability.

Abstract

Existing image perception methods based on VLMs generally follow a paradigm wherein models extract and analyze image content based on user-provided textual task prompts. However, such methods face limitations when applied to UAV imagery, which presents challenges like target confusion, scale variations, and complex backgrounds. These challenges arise because VLMs' understanding of image content depends on the semantic alignment between visual and textual tokens. When the task prompt is simplistic and the image content is complex, achieving effective alignment becomes difficult, limiting the model's ability to focus on task-relevant information. To address this issue, we introduce AerialVP, the first agent framework for task prompt enhancement in UAV image perception. AerialVP proactively extracts multi-dimensional auxiliary information from UAV images to enhance task prompts, overcoming the limitations of traditional VLM-based approaches. Specifically, the enhancement process includes three stages: (1) analyzing the task prompt to identify the task type and enhancement needs, (2) selecting appropriate tools from the tool repository, and (3) generating enhanced task prompts based on the analysis and selected tools. To evaluate AerialVP, we introduce AerialSense, a comprehensive benchmark for UAV image perception that includes Aerial Visual Reasoning, Aerial Visual Question Answering, and Aerial Visual Grounding tasks. AerialSense provides a standardized basis for evaluating model generalization and performance across diverse resolutions, lighting conditions, and both urban and natural scenes. Experimental results demonstrate that AerialVP significantly enhances task prompt guidance, leading to stable and substantial performance improvements in both open-source and proprietary VLMs. Our work will be available at https://github.com/lostwolves/AerialVP.

Towards Accurate UAV Image Perception: Guiding Vision-Language Models with Stronger Task Prompts

TL;DR

The paper tackles UAV image perception challenges caused by complex scenes and varying viewpoints by shifting from model-centric to prompt-centric strategies. It introduces AerialVP, a training-free agent that analyzes tasks, selects specialized tools, and generates enhanced prompts to guide vision–language models, along with AerialSense, a large UAV multi-task benchmark (VR, VQA, VG). Across open-source and proprietary models, enhanced prompts yield substantial gains, especially in grounding and multi-step reasoning, with attention-heatmap evidence supporting improved cross-modal alignment. The work demonstrates a scalable, modular approach for robust UAV perception and provides a rich dataset for evaluating generalization under diverse conditions, while outlining avenues for expanding tool coverage and improving generalizability.

Abstract

Existing image perception methods based on VLMs generally follow a paradigm wherein models extract and analyze image content based on user-provided textual task prompts. However, such methods face limitations when applied to UAV imagery, which presents challenges like target confusion, scale variations, and complex backgrounds. These challenges arise because VLMs' understanding of image content depends on the semantic alignment between visual and textual tokens. When the task prompt is simplistic and the image content is complex, achieving effective alignment becomes difficult, limiting the model's ability to focus on task-relevant information. To address this issue, we introduce AerialVP, the first agent framework for task prompt enhancement in UAV image perception. AerialVP proactively extracts multi-dimensional auxiliary information from UAV images to enhance task prompts, overcoming the limitations of traditional VLM-based approaches. Specifically, the enhancement process includes three stages: (1) analyzing the task prompt to identify the task type and enhancement needs, (2) selecting appropriate tools from the tool repository, and (3) generating enhanced task prompts based on the analysis and selected tools. To evaluate AerialVP, we introduce AerialSense, a comprehensive benchmark for UAV image perception that includes Aerial Visual Reasoning, Aerial Visual Question Answering, and Aerial Visual Grounding tasks. AerialSense provides a standardized basis for evaluating model generalization and performance across diverse resolutions, lighting conditions, and both urban and natural scenes. Experimental results demonstrate that AerialVP significantly enhances task prompt guidance, leading to stable and substantial performance improvements in both open-source and proprietary VLMs. Our work will be available at https://github.com/lostwolves/AerialVP.

Paper Structure

This paper contains 27 sections, 4 equations, 14 figures, 10 tables.

Figures (14)

  • Figure 1: Impact of task prompt design on VLM perception accuracy under different environments. (a) In simple natural scenes, a basic prompt effectively guides the VLM to focus on the correct target and generate accurate results. (b) In complex UAV scenes, a simple prompt lacking auxiliary information fails to direct attention, leading to incorrect results. (c) An enhanced prompt guides the VLM to focus on the correct target and achieve accurate perception.
  • Figure 2: Overall architecture of the AerialVP agent framework.
  • Figure 3: Structure of the Tool Repository in the AerialVP framework. The repository consists of two main categories of tools: Prompt Analysis Tools for task understanding and subtask planning, and Prompt Enhancement Tools for generating semantic and spatial descriptions that enhance task prompts.
  • Figure 4: Comparison between the Model-Centric UAV Perception Framework and the Prompt-Centric UAV Perception Framework. Here, $VLM_{\mathrm{training-required}}$ denotes a VLM trained on both large-scale general datasets and task-specific datasets, while $VLM_{\mathrm{training-free}}$ denotes a VLM that has not been trained on any task-specific datasets.
  • Figure 5: Workflow of the AerialVP agent for task prompt enhancement, consisting of three stages: Task Prompt Analysis, Tool Selection, and Enhanced Prompt Generation. The process is illustrated with the example of a vehicle localization task.
  • ...and 9 more figures