UAV-VLA: Vision-Language-Action System for Large Scale Aerial Mission Generation
Oleg Sautenkov, Yasheerah Yaqoot, Artem Lykov, Muhammad Ahsan Mustafa, Grik Tadevosyan, Aibek Akhmetkazy, Miguel Altamirano Cabrera, Mikhail Martynov, Sausar Karaf, Dzmitry Tsetserukou
TL;DR
The paper tackles the challenge of making UAV mission planning accessible and scalable by introducing UAV-VLA, a Vision-Language-Action system that converts natural-language requests into path-and-action plans using satellite imagery processed by a Visual Language Model and reasoning from GPT. It presents the UAV-VLPA-nano-30 global benchmark and demonstrates a three-module pipeline (goal extraction, object search, and action generation) that operates in a zero-shot setting without task-specific training. Results show substantial speedups (about 6.5x faster than human planning) with trajectories that are on average 21.6% longer than ground truth, and varying RMSE across alignment methods, indicating effective, scalable autonomous planning with room for precision improvements. This work advances language-based UAV mission generation and provides a foundation for fully autonomous, multimodal aerial operations across diverse environments.
Abstract
The UAV-VLA (Visual-Language-Action) system is a tool designed to facilitate communication with aerial robots. By integrating satellite imagery processing with the Visual Language Model (VLM) and the powerful capabilities of GPT, UAV-VLA enables users to generate general flight paths-and-action plans through simple text requests. This system leverages the rich contextual information provided by satellite images, allowing for enhanced decision-making and mission planning. The combination of visual analysis by VLM and natural language processing by GPT can provide the user with the path-and-action set, making aerial operations more efficient and accessible. The newly developed method showed the difference in the length of the created trajectory in 22% and the mean error in finding the objects of interest on a map in 34.22 m by Euclidean distance in the K-Nearest Neighbors (KNN) approach.
