Table of Contents
Fetching ...

Reverse Forward Curriculum Learning for Extreme Sample and Demonstration Efficiency in Reinforcement Learning

Stone Tao, Arth Shukla, Tse-kai Chan, Hao Su

TL;DR

The paper tackles the challenge of sample- and demonstration-efficient reinforcement learning under sparse rewards. It introduces Reverse Forward Curriculum Learning (RFCL), a two-stage approach that first builds a policy from a narrow initial-state distribution via a per-demonstration reverse curriculum and then generalizes to the full initial-state distribution with a forward curriculum, all within an off-policy SAC framework with a Q-ensemble. RFCL demonstrates strong demonstration and sample efficiency across 21 manipulation tasks spanning MetaWorld, Adroit, and ManiSkill2, solving previously intractable tasks with as few as 5 demonstrations. Ablation studies show the critical roles of per-demonstration reverses and the forward curriculum, while simulations highlight robustness to demonstration sources. The authors provide open-source code to enable replication and extension, highlighting the method's potential to reduce data requirements in real-world robotic learning.

Abstract

Reinforcement learning (RL) presents a promising framework to learn policies through environment interaction, but often requires an infeasible amount of interaction data to solve complex tasks from sparse rewards. One direction includes augmenting RL with offline data demonstrating desired tasks, but past work often require a lot of high-quality demonstration data that is difficult to obtain, especially for domains such as robotics. Our approach consists of a reverse curriculum followed by a forward curriculum. Unique to our approach compared to past work is the ability to efficiently leverage more than one demonstration via a per-demonstration reverse curriculum generated via state resets. The result of our reverse curriculum is an initial policy that performs well on a narrow initial state distribution and helps overcome difficult exploration problems. A forward curriculum is then used to accelerate the training of the initial policy to perform well on the full initial state distribution of the task and improve demonstration and sample efficiency. We show how the combination of a reverse curriculum and forward curriculum in our method, RFCL, enables significant improvements in demonstration and sample efficiency compared against various state-of-the-art learning-from-demonstration baselines, even solving previously unsolvable tasks that require high precision and control.

Reverse Forward Curriculum Learning for Extreme Sample and Demonstration Efficiency in Reinforcement Learning

TL;DR

The paper tackles the challenge of sample- and demonstration-efficient reinforcement learning under sparse rewards. It introduces Reverse Forward Curriculum Learning (RFCL), a two-stage approach that first builds a policy from a narrow initial-state distribution via a per-demonstration reverse curriculum and then generalizes to the full initial-state distribution with a forward curriculum, all within an off-policy SAC framework with a Q-ensemble. RFCL demonstrates strong demonstration and sample efficiency across 21 manipulation tasks spanning MetaWorld, Adroit, and ManiSkill2, solving previously intractable tasks with as few as 5 demonstrations. Ablation studies show the critical roles of per-demonstration reverses and the forward curriculum, while simulations highlight robustness to demonstration sources. The authors provide open-source code to enable replication and extension, highlighting the method's potential to reduce data requirements in real-world robotic learning.

Abstract

Reinforcement learning (RL) presents a promising framework to learn policies through environment interaction, but often requires an infeasible amount of interaction data to solve complex tasks from sparse rewards. One direction includes augmenting RL with offline data demonstrating desired tasks, but past work often require a lot of high-quality demonstration data that is difficult to obtain, especially for domains such as robotics. Our approach consists of a reverse curriculum followed by a forward curriculum. Unique to our approach compared to past work is the ability to efficiently leverage more than one demonstration via a per-demonstration reverse curriculum generated via state resets. The result of our reverse curriculum is an initial policy that performs well on a narrow initial state distribution and helps overcome difficult exploration problems. A forward curriculum is then used to accelerate the training of the initial policy to perform well on the full initial state distribution of the task and improve demonstration and sample efficiency. We show how the combination of a reverse curriculum and forward curriculum in our method, RFCL, enables significant improvements in demonstration and sample efficiency compared against various state-of-the-art learning-from-demonstration baselines, even solving previously unsolvable tasks that require high precision and control.
Paper Structure (28 sections, 2 equations, 19 figures, 6 tables)

This paper contains 28 sections, 2 equations, 19 figures, 6 tables.

Figures (19)

  • Figure 1: Results over 3 environment suites and the hardest task given fixed compute budgets ranging from 1M to 10M samples and few demonstrations. Our RFCL method drastically outperform recent approaches like JSRL and RLPD which are included in the baselines.
  • Figure 2: A simplfied view of the reverse and forward curriculum. The blue arrows represent the given demonstration trajectories (2 in this example), starting from an initial state and moving towards the goal marked by a gold star. The area covered by dashed green lines represent the distribution of initial states from which the policy can achieve high return. The area shaded in red represents the most frequently sampled initial states during each stage of curriculum. From left to right represents the progression of the trained policies ability over the course of the curriculum training.
  • Figure 3: Mean success rate of algorithms for each environment suite across all tasks after 1M interaction steps with 5 demonstrations. Results are averaged within environment. Shaded areas represent 95% CIs over 5 seeds. The result show RFCL is significantly more performant and sample efficient compared to baselines. Note that RLPD and JSRL can normally achieve decent results but require many more demonstrations for harder tasks.
  • Figure 5: Mean success rate of algorithms on ManiSkill2 tasks after 2M interaction steps given varying amount of demonstrations to train on. Error bars represent 95% CIs over 5 seeds. Vertical gray lines indicate the average number of samples until the reverse curriculum completes. Reverse curriculum only is RFCL but instead of a forward curriculum in stage 2 we sample uniformally from the initial state distribution. Forward curriculum only skips stage 1 training entirely.
  • Figure 6: Heatmap of agent's success rate at each maze cell over the course of training, comparing three kinds of training: None (no curriculum / normal RL), forward curriculum only, and our method applying reverse and forward curriculums. Blue arrow is the demonstration provided. Red dot is the goal.
  • ...and 14 more figures