Table of Contents
Fetching ...

RaceVLA: VLA-based Racing Drone Navigation with Human-like Behaviour

Valerii Serpiva, Artem Lykov, Artyom Myshlyaev, Muhammad Haris Khan, Ali Alridha Abdulkarim, Oleg Sautenkov, Dzmitry Tsetserukou

TL;DR

RaceVLA introduces a VLA-based framework for autonomous drone racing that maps FPV visuals and natural language instructions directly to a 4D action signal $(V_x,V_y,V_z,\omega)$, enabling human-like, real-time decision making. Built on fine-tuning OpenVLA with a drone-specific dataset via LoRA adapters, RaceVLA runs at 4 Hz on a GPU-accelerated pipeline and operates within a ROS-enabled onboard/outboard architecture that includes OpenVINS localization. Experiments show RaceVLA outperforms RT-2 across visual, motion, physical, and semantic generalization and achieves strong motion and semantic generalization relative to OpenVLA, albeit with some drawbacks in visual and physical generalization due to dynamic drone vision. The results suggest RaceVLA offers robust, high-speed navigation capabilities for competitive drone racing, with ongoing work aimed at reducing inference time and expanding data diversity to further improve generalization in diverse environments.

Abstract

RaceVLA presents an innovative approach for autonomous racing drone navigation by leveraging Visual-Language-Action (VLA) to emulate human-like behavior. This research explores the integration of advanced algorithms that enable drones to adapt their navigation strategies based on real-time environmental feedback, mimicking the decision-making processes of human pilots. The model, fine-tuned on a collected racing drone dataset, demonstrates strong generalization despite the complexity of drone racing environments. RaceVLA outperforms OpenVLA in motion (75.0 vs 60.0) and semantic generalization (45.5 vs 36.3), benefiting from the dynamic camera and simplified motion tasks. However, visual (79.6 vs 87.0) and physical (50.0 vs 76.7) generalization were slightly reduced due to the challenges of maneuvering in dynamic environments with varying object sizes. RaceVLA also outperforms RT-2 across all axes - visual (79.6 vs 52.0), motion (75.0 vs 55.0), physical (50.0 vs 26.7), and semantic (45.5 vs 38.8), demonstrating its robustness for real-time adjustments in complex environments. Experiments revealed an average velocity of 1.04 m/s, with a maximum speed of 2.02 m/s, and consistent maneuverability, demonstrating RaceVLA's ability to handle high-speed scenarios effectively. These findings highlight the potential of RaceVLA for high-performance navigation in competitive racing contexts. The RaceVLA codebase, pretrained weights, and dataset are available at this http URL: https://racevla.github.io/

RaceVLA: VLA-based Racing Drone Navigation with Human-like Behaviour

TL;DR

RaceVLA introduces a VLA-based framework for autonomous drone racing that maps FPV visuals and natural language instructions directly to a 4D action signal , enabling human-like, real-time decision making. Built on fine-tuning OpenVLA with a drone-specific dataset via LoRA adapters, RaceVLA runs at 4 Hz on a GPU-accelerated pipeline and operates within a ROS-enabled onboard/outboard architecture that includes OpenVINS localization. Experiments show RaceVLA outperforms RT-2 across visual, motion, physical, and semantic generalization and achieves strong motion and semantic generalization relative to OpenVLA, albeit with some drawbacks in visual and physical generalization due to dynamic drone vision. The results suggest RaceVLA offers robust, high-speed navigation capabilities for competitive drone racing, with ongoing work aimed at reducing inference time and expanding data diversity to further improve generalization in diverse environments.

Abstract

RaceVLA presents an innovative approach for autonomous racing drone navigation by leveraging Visual-Language-Action (VLA) to emulate human-like behavior. This research explores the integration of advanced algorithms that enable drones to adapt their navigation strategies based on real-time environmental feedback, mimicking the decision-making processes of human pilots. The model, fine-tuned on a collected racing drone dataset, demonstrates strong generalization despite the complexity of drone racing environments. RaceVLA outperforms OpenVLA in motion (75.0 vs 60.0) and semantic generalization (45.5 vs 36.3), benefiting from the dynamic camera and simplified motion tasks. However, visual (79.6 vs 87.0) and physical (50.0 vs 76.7) generalization were slightly reduced due to the challenges of maneuvering in dynamic environments with varying object sizes. RaceVLA also outperforms RT-2 across all axes - visual (79.6 vs 52.0), motion (75.0 vs 55.0), physical (50.0 vs 26.7), and semantic (45.5 vs 38.8), demonstrating its robustness for real-time adjustments in complex environments. Experiments revealed an average velocity of 1.04 m/s, with a maximum speed of 2.02 m/s, and consistent maneuverability, demonstrating RaceVLA's ability to handle high-speed scenarios effectively. These findings highlight the potential of RaceVLA for high-performance navigation in competitive racing contexts. The RaceVLA codebase, pretrained weights, and dataset are available at this http URL: https://racevla.github.io/

Paper Structure

This paper contains 10 sections, 6 figures.

Figures (6)

  • Figure 1: RaceVLA is the first VLA model specifically designed for racing drones. It processes First-Person View (FPV) video streams alongside natural language commands to generate velocity actions (Vx, Vy, Vz) and yaw anglular speed ($\omega$) control signals. This innovative system enables drones to autonomously execute a wide range of flight tasks, including the navigation in novel scenarios in unfamiliar environments. By leveraging a purpose-built training dataset, RaceVLA exhibits robust generalization capabilities.
  • Figure 2: RaceVLA system architecture.
  • Figure 3: a) Dataset for the task "Fly through the arch gate" featuring trajectories and actions for navigating arch-shaped gates. b) Dataset for the task "Fly through the square gate" highlighting data specific to square-shaped gates. c) "Fly through multiple gates on circular track". d) "Fly through gates on circular track".
  • Figure 4: a) Plots of recorded trajectories in the circular track task. The recorded trajectory (gray), actions generated by the VLA model (red arrows), and drone racing gates (gray lines) are shown. b) The right plot visualizes the velocity action vector and yaw rotation of the drone for 3 laps.
  • Figure 5: Evaluation of the RaceVLA system for autonomous drone navigation through racing gates, starting from different initial positions (seen, unseen positions). (a) Performance of the model starting the drone from different initial positions with the task "Fly through one gate." (b) Model evaluation for tasks "Fly through the Right gate" and "Fly through the Left gate." (c) Evaluation for sequential tasks "Fly through the Arch gate," "Fly through the Square gate," "Fly through the Right gate," and "Fly through the Left gate." (d) Evaluation in a scenario where the Arch gate is positioned on the right side of the flight zone, and the Square gate is on the left side of the flight zone.
  • ...and 1 more figures