RaceVLA: VLA-based Racing Drone Navigation with Human-like Behaviour
Valerii Serpiva, Artem Lykov, Artyom Myshlyaev, Muhammad Haris Khan, Ali Alridha Abdulkarim, Oleg Sautenkov, Dzmitry Tsetserukou
TL;DR
RaceVLA introduces a VLA-based framework for autonomous drone racing that maps FPV visuals and natural language instructions directly to a 4D action signal $(V_x,V_y,V_z,\omega)$, enabling human-like, real-time decision making. Built on fine-tuning OpenVLA with a drone-specific dataset via LoRA adapters, RaceVLA runs at 4 Hz on a GPU-accelerated pipeline and operates within a ROS-enabled onboard/outboard architecture that includes OpenVINS localization. Experiments show RaceVLA outperforms RT-2 across visual, motion, physical, and semantic generalization and achieves strong motion and semantic generalization relative to OpenVLA, albeit with some drawbacks in visual and physical generalization due to dynamic drone vision. The results suggest RaceVLA offers robust, high-speed navigation capabilities for competitive drone racing, with ongoing work aimed at reducing inference time and expanding data diversity to further improve generalization in diverse environments.
Abstract
RaceVLA presents an innovative approach for autonomous racing drone navigation by leveraging Visual-Language-Action (VLA) to emulate human-like behavior. This research explores the integration of advanced algorithms that enable drones to adapt their navigation strategies based on real-time environmental feedback, mimicking the decision-making processes of human pilots. The model, fine-tuned on a collected racing drone dataset, demonstrates strong generalization despite the complexity of drone racing environments. RaceVLA outperforms OpenVLA in motion (75.0 vs 60.0) and semantic generalization (45.5 vs 36.3), benefiting from the dynamic camera and simplified motion tasks. However, visual (79.6 vs 87.0) and physical (50.0 vs 76.7) generalization were slightly reduced due to the challenges of maneuvering in dynamic environments with varying object sizes. RaceVLA also outperforms RT-2 across all axes - visual (79.6 vs 52.0), motion (75.0 vs 55.0), physical (50.0 vs 26.7), and semantic (45.5 vs 38.8), demonstrating its robustness for real-time adjustments in complex environments. Experiments revealed an average velocity of 1.04 m/s, with a maximum speed of 2.02 m/s, and consistent maneuverability, demonstrating RaceVLA's ability to handle high-speed scenarios effectively. These findings highlight the potential of RaceVLA for high-performance navigation in competitive racing contexts. The RaceVLA codebase, pretrained weights, and dataset are available at this http URL: https://racevla.github.io/
