UAV-VLA: Vision-Language-Action System for Large Scale Aerial Mission Generation

Oleg Sautenkov; Yasheerah Yaqoot; Artem Lykov; Muhammad Ahsan Mustafa; Grik Tadevosyan; Aibek Akhmetkazy; Miguel Altamirano Cabrera; Mikhail Martynov; Sausar Karaf; Dzmitry Tsetserukou

UAV-VLA: Vision-Language-Action System for Large Scale Aerial Mission Generation

Oleg Sautenkov, Yasheerah Yaqoot, Artem Lykov, Muhammad Ahsan Mustafa, Grik Tadevosyan, Aibek Akhmetkazy, Miguel Altamirano Cabrera, Mikhail Martynov, Sausar Karaf, Dzmitry Tsetserukou

TL;DR

The paper tackles the challenge of making UAV mission planning accessible and scalable by introducing UAV-VLA, a Vision-Language-Action system that converts natural-language requests into path-and-action plans using satellite imagery processed by a Visual Language Model and reasoning from GPT. It presents the UAV-VLPA-nano-30 global benchmark and demonstrates a three-module pipeline (goal extraction, object search, and action generation) that operates in a zero-shot setting without task-specific training. Results show substantial speedups (about 6.5x faster than human planning) with trajectories that are on average 21.6% longer than ground truth, and varying RMSE across alignment methods, indicating effective, scalable autonomous planning with room for precision improvements. This work advances language-based UAV mission generation and provides a foundation for fully autonomous, multimodal aerial operations across diverse environments.

Abstract

The UAV-VLA (Visual-Language-Action) system is a tool designed to facilitate communication with aerial robots. By integrating satellite imagery processing with the Visual Language Model (VLM) and the powerful capabilities of GPT, UAV-VLA enables users to generate general flight paths-and-action plans through simple text requests. This system leverages the rich contextual information provided by satellite images, allowing for enhanced decision-making and mission planning. The combination of visual analysis by VLM and natural language processing by GPT can provide the user with the path-and-action set, making aerial operations more efficient and accessible. The newly developed method showed the difference in the length of the created trajectory in 22% and the mean error in finding the objects of interest on a map in 34.22 m by Euclidean distance in the K-Nearest Neighbors (KNN) approach.

UAV-VLA: Vision-Language-Action System for Large Scale Aerial Mission Generation

TL;DR

Abstract

Paper Structure (12 sections, 6 equations, 6 figures, 1 table)

This paper contains 12 sections, 6 equations, 6 figures, 1 table.

Introduction
Related Work
Data and Benchmark
Satellite Images And Metadata description
Manual Flight Plan Generation
Methodology
Experiments
Evaluation Metrics
System Setup and Procedure
Experimental Results
Conclusion
Future Work

Figures (6)

Figure 1: The pipeline of the UAV-VLA system.
Figure 2: Examples of the satellite imagery in the benchmark data.
Figure 3: Mission Planner environment showing the violet square boundary, home position and buildings with the text description.
Figure 4: Comparison of flight plans generated by a human expert (a) and the UAV-VLA system (b).
Figure 5: The comparison of the trajectory lengths made by UAV-VLA and by an experienced operator.
...and 1 more figures

UAV-VLA: Vision-Language-Action System for Large Scale Aerial Mission Generation

TL;DR

Abstract

UAV-VLA: Vision-Language-Action System for Large Scale Aerial Mission Generation

Authors

TL;DR

Abstract

Table of Contents

Figures (6)