Table of Contents
Fetching ...

CognitiveDrone: A VLA Model and Evaluation Benchmark for Real-Time Cognitive Task Solving and Reasoning in UAVs

Artem Lykov, Valerii Serpiva, Muhammad Haris Khan, Oleg Sautenkov, Artyom Myshlyaev, Grik Tadevosyan, Yasheerah Yaqoot, Dzmitry Tsetserukou

TL;DR

CognitiveDrone tackles the absence of open benchmarks for cognitive UAVs by introducing a 7B-parameter Vision-Language-Action model trained on over 8,000 simulated trajectories and paired with a Gazebo-based CognitiveDroneBench that embeds cognitive tasks into a drone-racing track. The authors augment the base VLA with a slower VLM reasoning module (CognitiveDrone-R1) to disambiguate instructions, achieving significantly higher cognitive task success rates ($100\%$-style normalization) across Human Recognition, Symbol Understanding, and Reasoning. Results show RaceVLA excels at low-level flight but struggles with cognition, while CognitiveDrone substantially improves cognition, and CognitiveDrone-R1 delivers the best overall performance (77.2% average), demonstrating the value of explicit reasoning in real-time UAV control. The work provides open-source datasets, a benchmark, model weights, and training/inference code, establishing a new standard for evaluating cognitive capabilities in UAVs and enabling broader research in cognitive robotics.

Abstract

This paper introduces CognitiveDrone, a novel Vision-Language-Action (VLA) model tailored for complex Unmanned Aerial Vehicles (UAVs) tasks that demand advanced cognitive abilities. Trained on a dataset comprising over 8,000 simulated flight trajectories across three key categories-Human Recognition, Symbol Understanding, and Reasoning-the model generates real-time 4D action commands based on first-person visual inputs and textual instructions. To further enhance performance in intricate scenarios, we propose CognitiveDrone-R1, which integrates an additional Vision-Language Model (VLM) reasoning module to simplify task directives prior to high-frequency control. Experimental evaluations using our open-source benchmark, CognitiveDroneBench, reveal that while a racing-oriented model (RaceVLA) achieves an overall success rate of 31.3%, the base CognitiveDrone model reaches 59.6%, and CognitiveDrone-R1 attains a success rate of 77.2%. These results demonstrate improvements of up to 30% in critical cognitive tasks, underscoring the effectiveness of incorporating advanced reasoning capabilities into UAV control systems. Our contributions include the development of a state-of-the-art VLA model for UAV control and the introduction of the first dedicated benchmark for assessing cognitive tasks in drone operations. The complete repository is available at cognitivedrone.github.io

CognitiveDrone: A VLA Model and Evaluation Benchmark for Real-Time Cognitive Task Solving and Reasoning in UAVs

TL;DR

CognitiveDrone tackles the absence of open benchmarks for cognitive UAVs by introducing a 7B-parameter Vision-Language-Action model trained on over 8,000 simulated trajectories and paired with a Gazebo-based CognitiveDroneBench that embeds cognitive tasks into a drone-racing track. The authors augment the base VLA with a slower VLM reasoning module (CognitiveDrone-R1) to disambiguate instructions, achieving significantly higher cognitive task success rates (-style normalization) across Human Recognition, Symbol Understanding, and Reasoning. Results show RaceVLA excels at low-level flight but struggles with cognition, while CognitiveDrone substantially improves cognition, and CognitiveDrone-R1 delivers the best overall performance (77.2% average), demonstrating the value of explicit reasoning in real-time UAV control. The work provides open-source datasets, a benchmark, model weights, and training/inference code, establishing a new standard for evaluating cognitive capabilities in UAVs and enabling broader research in cognitive robotics.

Abstract

This paper introduces CognitiveDrone, a novel Vision-Language-Action (VLA) model tailored for complex Unmanned Aerial Vehicles (UAVs) tasks that demand advanced cognitive abilities. Trained on a dataset comprising over 8,000 simulated flight trajectories across three key categories-Human Recognition, Symbol Understanding, and Reasoning-the model generates real-time 4D action commands based on first-person visual inputs and textual instructions. To further enhance performance in intricate scenarios, we propose CognitiveDrone-R1, which integrates an additional Vision-Language Model (VLM) reasoning module to simplify task directives prior to high-frequency control. Experimental evaluations using our open-source benchmark, CognitiveDroneBench, reveal that while a racing-oriented model (RaceVLA) achieves an overall success rate of 31.3%, the base CognitiveDrone model reaches 59.6%, and CognitiveDrone-R1 attains a success rate of 77.2%. These results demonstrate improvements of up to 30% in critical cognitive tasks, underscoring the effectiveness of incorporating advanced reasoning capabilities into UAV control systems. Our contributions include the development of a state-of-the-art VLA model for UAV control and the introduction of the first dedicated benchmark for assessing cognitive tasks in drone operations. The complete repository is available at cognitivedrone.github.io

Paper Structure

This paper contains 12 sections, 5 figures.

Figures (5)

  • Figure 1: CognitiveDrone is a VLA system for UAVs that generates smooth 4D control commands from first-person visual inputs and natural language instructions. It combines a 7B-parameter VLA model trained on an extensive open-source dataset of cognitive tasks—including reasoning, human recognition, and symbol understanding—with a 7B-parameter VLM reasoning module that refines task directives. The system is evaluated within CognitiveDroneBench—the first evaluation benchmark for VLA systems tailored to cognitive UAVs—where the drone must navigate a track with gates by selecting the appropriate gate through solving cognitive tasks. We have released the complete dataset, benchmark environment, model weights, and training/inference code as open source.
  • Figure 2: CognitiveDrone system architecture.
  • Figure 3: Examples of prepared dataset tasks for VLA to solve cognitive tasks adapted for UAVs.
  • Figure 4: Metrics Overview: (a) L1 loss indicates absolute prediction errors. (b) Action accuracy quantifies the percentage of correct predictions. (c) Cross-entropy loss measures performance on discretized action tokens.
  • Figure 5: Benchmark performance on CognitiveDroneBench for the RaceVLA, CognitiveDrone, and CognitiveDrone-R1 models. Shown are scores for Reasoning, Human Recognition, and Symbol Understanding tasks, as well as the overall average.