Table of Contents
Fetching ...

Learning Deep Sensorimotor Policies for Vision-based Autonomous Drone Racing

Jiawei Fu, Yunlong Song, Yan Wu, Fisher Yu, Davide Scaramuzza

TL;DR

The paper addresses vision-based autonomous drone racing by eliminating the need for global state estimation and trajectory planning through a deep sensorimotor policy learned from raw images. It introduces a two-stage learning-by-cheating framework: a privileged-state teacher trained with full state information via PPO, and a vision-only student that learns to map image embeddings to control commands through imitation, aided by BYOL-style contrastive learning and YOLO-based feature extraction. In Flightmare, the vision-based policy achieves racing performance near the state-based policy and near the time-optimal bound, with strong robustness to visual disturbances and distractors. This work demonstrates the feasibility of image-only control for high-speed drones and points toward real-world transfer and history-based (memory) extensions to remove reliance on partial state inputs.

Abstract

Autonomous drones can operate in remote and unstructured environments, enabling various real-world applications. However, the lack of effective vision-based algorithms has been a stumbling block to achieving this goal. Existing systems often require hand-engineered components for state estimation, planning, and control. Such a sequential design involves laborious tuning, human heuristics, and compounding delays and errors. This paper tackles the vision-based autonomous-drone-racing problem by learning deep sensorimotor policies. We use contrastive learning to extract robust feature representations from the input images and leverage a two-stage learning-by-cheating framework for training a neural network policy. The resulting policy directly infers control commands with feature representations learned from raw images, forgoing the need for globally-consistent state estimation, trajectory planning, and handcrafted control design. Our experimental results indicate that our vision-based policy can achieve the same level of racing performance as the state-based policy while being robust against different visual disturbances and distractors. We believe this work serves as a stepping-stone toward developing intelligent vision-based autonomous systems that control the drone purely from image inputs, like human pilots.

Learning Deep Sensorimotor Policies for Vision-based Autonomous Drone Racing

TL;DR

The paper addresses vision-based autonomous drone racing by eliminating the need for global state estimation and trajectory planning through a deep sensorimotor policy learned from raw images. It introduces a two-stage learning-by-cheating framework: a privileged-state teacher trained with full state information via PPO, and a vision-only student that learns to map image embeddings to control commands through imitation, aided by BYOL-style contrastive learning and YOLO-based feature extraction. In Flightmare, the vision-based policy achieves racing performance near the state-based policy and near the time-optimal bound, with strong robustness to visual disturbances and distractors. This work demonstrates the feasibility of image-only control for high-speed drones and points toward real-world transfer and history-based (memory) extensions to remove reliance on partial state inputs.

Abstract

Autonomous drones can operate in remote and unstructured environments, enabling various real-world applications. However, the lack of effective vision-based algorithms has been a stumbling block to achieving this goal. Existing systems often require hand-engineered components for state estimation, planning, and control. Such a sequential design involves laborious tuning, human heuristics, and compounding delays and errors. This paper tackles the vision-based autonomous-drone-racing problem by learning deep sensorimotor policies. We use contrastive learning to extract robust feature representations from the input images and leverage a two-stage learning-by-cheating framework for training a neural network policy. The resulting policy directly infers control commands with feature representations learned from raw images, forgoing the need for globally-consistent state estimation, trajectory planning, and handcrafted control design. Our experimental results indicate that our vision-based policy can achieve the same level of racing performance as the state-based policy while being robust against different visual disturbances and distractors. We believe this work serves as a stepping-stone toward developing intelligent vision-based autonomous systems that control the drone purely from image inputs, like human pilots.
Paper Structure (12 sections, 3 equations, 6 figures, 3 tables)

This paper contains 12 sections, 3 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Overview of our policy training method. We first train a teacher policy with access to privileged state information using model-free reinforcement learning. This teacher policy is then distilled into a student policy, which is trained to do perception, planning, and control jointly.
  • Figure 2: Contrastive learning framework grill2020bootstrap.
  • Figure 3: Visualization of data augmentations used during training. Left: no augmentation. Middle: random convolution. Right: random cutout-color.
  • Figure 4: Visualization of trajectories. Left: Circle. Middle: Figure8. Right: SplitS.
  • Figure 5: Success rates of the state-based policy over position drift.
  • ...and 1 more figures