Table of Contents
Fetching ...

RAPTOR: A Foundation Policy for Quadrotor Control

Jonas Eschmann, Dario Albani, Giuseppe Loianno

Abstract

Humans are remarkably data-efficient when adapting to new unseen conditions, like driving a new car. In contrast, modern robotic control systems, like neural network policies trained using Reinforcement Learning (RL), are highly specialized for single environments. Because of this overfitting, they are known to break down even under small differences like the Simulation-to-Reality (Sim2Real) gap and require system identification and retraining for even minimal changes to the system. In this work, we present RAPTOR, a method for training a highly adaptive foundation policy for quadrotor control. Our method enables training a single, end-to-end neural-network policy to control a wide variety of quadrotors. We test 10 different real quadrotors from 32 g to 2.4 kg that also differ in motor type (brushed vs. brushless), frame type (soft vs. rigid), propeller type (2/3/4-blade), and flight controller (PX4/Betaflight/Crazyflie/M5StampFly). We find that a tiny, three-layer policy with only 2084 parameters is sufficient for zero-shot adaptation to a wide variety of platforms. The adaptation through in-context learning is made possible by using a recurrence in the hidden layer. The policy is trained through our proposed Meta-Imitation Learning algorithm, where we sample 1000 quadrotors and train a teacher policy for each of them using RL. Subsequently, the 1000 teachers are distilled into a single, adaptive student policy. We find that within milliseconds, the resulting foundation policy adapts zero-shot to unseen quadrotors. We extensively test the capabilities of the foundation policy under numerous conditions (trajectory tracking, indoor/outdoor, wind disturbance, poking, different propellers).

RAPTOR: A Foundation Policy for Quadrotor Control

Abstract

Humans are remarkably data-efficient when adapting to new unseen conditions, like driving a new car. In contrast, modern robotic control systems, like neural network policies trained using Reinforcement Learning (RL), are highly specialized for single environments. Because of this overfitting, they are known to break down even under small differences like the Simulation-to-Reality (Sim2Real) gap and require system identification and retraining for even minimal changes to the system. In this work, we present RAPTOR, a method for training a highly adaptive foundation policy for quadrotor control. Our method enables training a single, end-to-end neural-network policy to control a wide variety of quadrotors. We test 10 different real quadrotors from 32 g to 2.4 kg that also differ in motor type (brushed vs. brushless), frame type (soft vs. rigid), propeller type (2/3/4-blade), and flight controller (PX4/Betaflight/Crazyflie/M5StampFly). We find that a tiny, three-layer policy with only 2084 parameters is sufficient for zero-shot adaptation to a wide variety of platforms. The adaptation through in-context learning is made possible by using a recurrence in the hidden layer. The policy is trained through our proposed Meta-Imitation Learning algorithm, where we sample 1000 quadrotors and train a teacher policy for each of them using RL. Subsequently, the 1000 teachers are distilled into a single, adaptive student policy. We find that within milliseconds, the resulting foundation policy adapts zero-shot to unseen quadrotors. We extensively test the capabilities of the foundation policy under numerous conditions (trajectory tracking, indoor/outdoor, wind disturbance, poking, different propellers).

Paper Structure

This paper contains 9 sections, 18 equations, 9 figures.

Figures (9)

  • Figure 1: (A) Motivation. Comparison of the adaptation capabilities of humans, contemporary RL-based control policies, and our RAPTOR method. (B) The RAPTOR Method. Overview of all stages of the RAPTOR architecture.
  • Figure 2: Training Results.(A) shows the pre-training learning curve, (B) shows the meta-imitation learning curve where the policy is evaluated using a validation set of $7$ quadrotors that are not seen during training, (C) shows the Pareto frontier between performance and number of teachers, and (D) shows the Pareto frontier between performance and student/foundation policy size.
  • Figure 3: Inference Results. Here we show a recovery of a simulated quadrotor from an adverse initial condition using the trained foundation policy. We show the latent state of the policy throughout the trajectory and test if it performs emergent/implicit system identification by training a linear probe.
  • Figure 4: Test Quadrotors. A diverse set of $10$ real and $2$ simulated quadrotors that we use in the experiments.
  • Figure 5: Trajectory Tracking Results. Trajectory tracking results of the $10$ real and $2$ simulation quadrotors.
  • ...and 4 more figures