Learning Deep Sensorimotor Policies for Vision-based Autonomous Drone Racing
Jiawei Fu, Yunlong Song, Yan Wu, Fisher Yu, Davide Scaramuzza
TL;DR
The paper addresses vision-based autonomous drone racing by eliminating the need for global state estimation and trajectory planning through a deep sensorimotor policy learned from raw images. It introduces a two-stage learning-by-cheating framework: a privileged-state teacher trained with full state information via PPO, and a vision-only student that learns to map image embeddings to control commands through imitation, aided by BYOL-style contrastive learning and YOLO-based feature extraction. In Flightmare, the vision-based policy achieves racing performance near the state-based policy and near the time-optimal bound, with strong robustness to visual disturbances and distractors. This work demonstrates the feasibility of image-only control for high-speed drones and points toward real-world transfer and history-based (memory) extensions to remove reliance on partial state inputs.
Abstract
Autonomous drones can operate in remote and unstructured environments, enabling various real-world applications. However, the lack of effective vision-based algorithms has been a stumbling block to achieving this goal. Existing systems often require hand-engineered components for state estimation, planning, and control. Such a sequential design involves laborious tuning, human heuristics, and compounding delays and errors. This paper tackles the vision-based autonomous-drone-racing problem by learning deep sensorimotor policies. We use contrastive learning to extract robust feature representations from the input images and leverage a two-stage learning-by-cheating framework for training a neural network policy. The resulting policy directly infers control commands with feature representations learned from raw images, forgoing the need for globally-consistent state estimation, trajectory planning, and handcrafted control design. Our experimental results indicate that our vision-based policy can achieve the same level of racing performance as the state-based policy while being robust against different visual disturbances and distractors. We believe this work serves as a stepping-stone toward developing intelligent vision-based autonomous systems that control the drone purely from image inputs, like human pilots.
