Table of Contents
Fetching ...

Curriculum-based Sample Efficient Reinforcement Learning for Robust Stabilization of a Quadrotor

Fausto Mauricio Lagos Suarez, Akshit Saradagi, Vidya Sumathy, Shruti Kotpaliwar, George Nikolakopoulos

TL;DR

This paper addresses the challenge of sample-efficient reinforcement learning for robust quadrotor stabilization from random initial states and disturbances. It introduces a three-stage curriculum combined with an end-to-end PPO policy and a compounded reward $R(t)$ that encodes target, exploration, stability, and navigation terms, e.g., $R(t)=25-20T_e-100E+20S-18w_e$. The curriculum progressively increases task difficulty across fixed hovering, randomized poses, and randomized velocities, with knowledge transfer between stages. In physics-based simulations, the approach yields superior performance and faster convergence than a single-stage baseline, demonstrating strong stabilization and disturbance rejection, while also highlighting ongoing challenges for sim-to-real transfer and curriculum automation.

Abstract

This article introduces a curriculum learning approach to develop a reinforcement learning-based robust stabilizing controller for a Quadrotor that meets predefined performance criteria. The learning objective is to achieve desired positions from random initial conditions while adhering to both transient and steady-state performance specifications. This objective is challenging for conventional one-stage end-to-end reinforcement learning, due to the strong coupling between position and orientation dynamics, the complexity in designing and tuning the reward function, and poor sample efficiency, which necessitates substantial computational resources and leads to extended convergence times. To address these challenges, this work decomposes the learning objective into a three-stage curriculum that incrementally increases task complexity. The curriculum begins with learning to achieve stable hovering from a fixed initial condition, followed by progressively introducing randomization in initial positions, orientations and velocities. A novel additive reward function is proposed, to incorporate transient and steady-state performance specifications. The results demonstrate that the Proximal Policy Optimization (PPO)-based curriculum learning approach, coupled with the proposed reward structure, achieves superior performance compared to a single-stage PPO-trained policy with the same reward function, while significantly reducing computational resource requirements and convergence time. The curriculum-trained policy's performance and robustness are thoroughly validated under random initial conditions and in the presence of disturbances.

Curriculum-based Sample Efficient Reinforcement Learning for Robust Stabilization of a Quadrotor

TL;DR

This paper addresses the challenge of sample-efficient reinforcement learning for robust quadrotor stabilization from random initial states and disturbances. It introduces a three-stage curriculum combined with an end-to-end PPO policy and a compounded reward that encodes target, exploration, stability, and navigation terms, e.g., . The curriculum progressively increases task difficulty across fixed hovering, randomized poses, and randomized velocities, with knowledge transfer between stages. In physics-based simulations, the approach yields superior performance and faster convergence than a single-stage baseline, demonstrating strong stabilization and disturbance rejection, while also highlighting ongoing challenges for sim-to-real transfer and curriculum automation.

Abstract

This article introduces a curriculum learning approach to develop a reinforcement learning-based robust stabilizing controller for a Quadrotor that meets predefined performance criteria. The learning objective is to achieve desired positions from random initial conditions while adhering to both transient and steady-state performance specifications. This objective is challenging for conventional one-stage end-to-end reinforcement learning, due to the strong coupling between position and orientation dynamics, the complexity in designing and tuning the reward function, and poor sample efficiency, which necessitates substantial computational resources and leads to extended convergence times. To address these challenges, this work decomposes the learning objective into a three-stage curriculum that incrementally increases task complexity. The curriculum begins with learning to achieve stable hovering from a fixed initial condition, followed by progressively introducing randomization in initial positions, orientations and velocities. A novel additive reward function is proposed, to incorporate transient and steady-state performance specifications. The results demonstrate that the Proximal Policy Optimization (PPO)-based curriculum learning approach, coupled with the proposed reward structure, achieves superior performance compared to a single-stage PPO-trained policy with the same reward function, while significantly reducing computational resource requirements and convergence time. The curriculum-trained policy's performance and robustness are thoroughly validated under random initial conditions and in the presence of disturbances.

Paper Structure

This paper contains 10 sections, 5 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: The Crazyflie Quadrotor.
  • Figure 2: Reinforcement Learning setup and configuration of the Neural Networks. The RL agent (actor network) interacts with the training environment by generating actions $a_t$, based on the current state $s_t$, provided by the environment. The actor-network outputs motor RPM values for the Quadrotor's four motors. The critic network estimates the value function, $V(s_t)$, to evaluate the state. The experience tuple $<s_t, a_t, r_{t+1}, s_{t+1}>$ is used to update both actor and critic networks through PPO's objective function, including the clipped surrogate loss, $L_{CLIP}(\theta)$. The state of the Quadrotor includes position, orientation, linear and angular velocities.
  • Figure 3: Episode Cumulative Reward (ECR) comparison between curriculum learning (sub-tasks 1, 2, and 3) and single-stage training (orange). After 20 million time steps (equivalent to 6.8 hours), the single-stage training remains unstable, failing to achieve higher cumulative rewards and demonstrating poor performance on the target task. In contrast, the curriculum learning approach progressively learns to achieve the target task. Sub-task 1 quickly achieves higher cumulative rewards, indicating excellent performance in learning to hover. This knowledge is then effectively transferred to subsequent tasks (Sub-task 2 and 3), which naturally exhibit lower cumulative rewards due to increased task difficulty introduced by random initial conditions and episode truncations.
  • Figure 4: Evaluation of the curriculum-trained policy using 30 trials, with initial positions chosen from three regions. At the starting position of the Quadrotor, we use the body coordinate frame to show variations in the initial attitude and a velocity vector (black arrow) to represent the randomized initial linear velocity (the length of the arrow is proportional to the magnitude of the velocity). The Quadrotor is also initialized with random non-zero angular velocities. Solid trajectories represent smooth maneuvers with small transients, while dashed trajectories are used to indicate large transients, where the drone gets close to the ground, to compensate for large unfavorable initial velocities. In all tests, the drone successfully reaches the target position.
  • Figure 5: Performance of the trained policy in achieving robust stabilization from 30 randomized initial conditions. The plots show the evolution of the position (top left), orientation (top right), linear velocities (bottom left), and angular velocities (bottom right) of the Quadrotor. Trajectories with initialization in the three regions A, B and C are shown in different colors: inner cylinder (Region A, blue), annular region (Region B, green), and outer cylinder (Region C, orange). Dashed traces indicate large transients, where the drone reaches close to the ground (seen from the evolution of the $z$ state), to overcome large initial velocities. The black dashed line represents the reference target values. The curriculum-trained policy consistently drives the system to the target position, from a diverse set of initial states.
  • ...and 2 more figures