Table of Contents
Fetching ...

Self-Supervised Curriculum Generation for Autonomous Reinforcement Learning without Task-Specific Knowledge

Sang-Hyun Lee, Seung-Woo Seo

TL;DR

This paper proposes a novel ARL algorithm that can generate a curriculum adaptive to the agent's learning progress without task-specific knowledge, and introduces a success discriminator that estimates the success probability from each initial state when the agent follows the forward policy.

Abstract

A significant bottleneck in applying current reinforcement learning algorithms to real-world scenarios is the need to reset the environment between every episode. This reset process demands substantial human intervention, making it difficult for the agent to learn continuously and autonomously. Several recent works have introduced autonomous reinforcement learning (ARL) algorithms that generate curricula for jointly training reset and forward policies. While their curricula can reduce the number of required manual resets by taking into account the agent's learning progress, they rely on task-specific knowledge, such as predefined initial states or reset reward functions. In this paper, we propose a novel ARL algorithm that can generate a curriculum adaptive to the agent's learning progress without task-specific knowledge. Our curriculum empowers the agent to autonomously reset to diverse and informative initial states. To achieve this, we introduce a success discriminator that estimates the success probability from each initial state when the agent follows the forward policy. The success discriminator is trained with relabeled transitions in a self-supervised manner. Our experimental results demonstrate that our ARL algorithm can generate an adaptive curriculum and enable the agent to efficiently bootstrap to solve sparse-reward maze navigation and manipulation tasks, outperforming baselines with significantly fewer manual resets.

Self-Supervised Curriculum Generation for Autonomous Reinforcement Learning without Task-Specific Knowledge

TL;DR

This paper proposes a novel ARL algorithm that can generate a curriculum adaptive to the agent's learning progress without task-specific knowledge, and introduces a success discriminator that estimates the success probability from each initial state when the agent follows the forward policy.

Abstract

A significant bottleneck in applying current reinforcement learning algorithms to real-world scenarios is the need to reset the environment between every episode. This reset process demands substantial human intervention, making it difficult for the agent to learn continuously and autonomously. Several recent works have introduced autonomous reinforcement learning (ARL) algorithms that generate curricula for jointly training reset and forward policies. While their curricula can reduce the number of required manual resets by taking into account the agent's learning progress, they rely on task-specific knowledge, such as predefined initial states or reset reward functions. In this paper, we propose a novel ARL algorithm that can generate a curriculum adaptive to the agent's learning progress without task-specific knowledge. Our curriculum empowers the agent to autonomously reset to diverse and informative initial states. To achieve this, we introduce a success discriminator that estimates the success probability from each initial state when the agent follows the forward policy. The success discriminator is trained with relabeled transitions in a self-supervised manner. Our experimental results demonstrate that our ARL algorithm can generate an adaptive curriculum and enable the agent to efficiently bootstrap to solve sparse-reward maze navigation and manipulation tasks, outperforming baselines with significantly fewer manual resets.
Paper Structure (12 sections, 2 equations, 9 figures, 2 tables, 1 algorithm)

This paper contains 12 sections, 2 equations, 9 figures, 2 tables, 1 algorithm.

Figures (9)

  • Figure 1: Overview of our ARL algorithm. Our algorithm generates a curriculum adaptive to the learning progress of the forward policy without task-specific knowledge. To identify the initial states that enable the agent to acquire diverse and informative transitions, we introduce the success discriminator $C(s, a)$ trained with relabeled transitions in a self-supervised manner. The subset of initial states with an estimated success probability below $\lambda_1$ is represented as the red-shaded area, while the subset of initial states with an estimated success probability over $\lambda_2$ is represented as the green-shaded area. The goal is represented as the purple star. The rollouts from the forward policy are colored to indicate whether the agent reaches the goal or not.
  • Figure 2: Maze navigation tasks introduced in our work. The locations of the goals correspond to the positions of the agents in these snapshots. The routes for each goal are represented as red lines. These tasks require the agent to reach the goals from diverse initial states without access to extrinsic reset.
  • Figure 3: Manipulation tasks used in our work. These snapshots represent the target pose of a three-fingered hand robot and the target orientation of a three-pronged valve in each task.
  • Figure 4: Learning curves for maze navigation tasks. The x-axis represents the number of training steps and the y-axis represents one of the metrics used in our experiments. The darker-colored lines and shaded areas denote the means and standard deviations over 10 random seeds, respectively. These results imply that our algorithm consistently achieves more robust performance and better sample efficiency than state-of-the-art ARL algorithms on all tasks.
  • Figure 5: Initial states allowed by our adaptive curriculum on maze2d-2way-v1. We normalize each dimension of states to [0,1] and their colors denote the success probabilities estimated by the success discriminator. Our adaptive curriculum allows initial states to be generated only near the goal at (0.1, 0.1) during the early stages of training. As training progresses, it also allows initial states to be generated from locations increasingly distant from the goal.
  • ...and 4 more figures