Table of Contents
Fetching ...

DISCOVER: Automated Curricula for Sparse-Reward Reinforcement Learning

Leander Diaz-Bone, Marco Bagatella, Jonas Hübotter, Andreas Krause

TL;DR

Sparse-reward RL requires efficient long-horizon exploration and credit assignment. DISCOVER introduces a directed, goal-conditioned curriculum that selects intermediate goals by balancing achievability, novelty, and relevance using an ensemble of critics, with online adaptation of its parameters. The approach yields a UCB-inspired theoretical guarantee on the time to achieve the target and empirically outperforms state-of-the-art exploration strategies on high-dimensional tasks such as AntMaze, Arm, and PointMaze, with further gains from priors and subgoal learning. This directed exploration framework enables solving substantially harder tasks and points to future directions in goal generation, hierarchical planning, and cross-task knowledge reuse.

Abstract

Sparse-reward reinforcement learning (RL) can model a wide range of highly complex tasks. Solving sparse-reward tasks is RL's core premise, requiring efficient exploration coupled with long-horizon credit assignment, and overcoming these challenges is key for building self-improving agents with superhuman ability. Prior work commonly explores with the objective of solving many sparse-reward tasks, making exploration of individual high-dimensional, long-horizon tasks intractable. We argue that solving such challenging tasks requires solving simpler tasks that are relevant to the target task, i.e., whose achieval will teach the agent skills required for solving the target task. We demonstrate that this sense of direction, necessary for effective exploration, can be extracted from existing RL algorithms, without leveraging any prior information. To this end, we propose a method for directed sparse-reward goal-conditioned very long-horizon RL (DISCOVER), which selects exploratory goals in the direction of the target task. We connect DISCOVER to principled exploration in bandits, formally bounding the time until the target task becomes achievable in terms of the agent's initial distance to the target, but independent of the volume of the space of all tasks. We then perform a thorough evaluation in high-dimensional environments. We find that the directed goal selection of DISCOVER solves exploration problems that are beyond the reach of prior state-of-the-art exploration methods in RL.

DISCOVER: Automated Curricula for Sparse-Reward Reinforcement Learning

TL;DR

Sparse-reward RL requires efficient long-horizon exploration and credit assignment. DISCOVER introduces a directed, goal-conditioned curriculum that selects intermediate goals by balancing achievability, novelty, and relevance using an ensemble of critics, with online adaptation of its parameters. The approach yields a UCB-inspired theoretical guarantee on the time to achieve the target and empirically outperforms state-of-the-art exploration strategies on high-dimensional tasks such as AntMaze, Arm, and PointMaze, with further gains from priors and subgoal learning. This directed exploration framework enables solving substantially harder tasks and points to future directions in goal generation, hierarchical planning, and cross-task knowledge reuse.

Abstract

Sparse-reward reinforcement learning (RL) can model a wide range of highly complex tasks. Solving sparse-reward tasks is RL's core premise, requiring efficient exploration coupled with long-horizon credit assignment, and overcoming these challenges is key for building self-improving agents with superhuman ability. Prior work commonly explores with the objective of solving many sparse-reward tasks, making exploration of individual high-dimensional, long-horizon tasks intractable. We argue that solving such challenging tasks requires solving simpler tasks that are relevant to the target task, i.e., whose achieval will teach the agent skills required for solving the target task. We demonstrate that this sense of direction, necessary for effective exploration, can be extracted from existing RL algorithms, without leveraging any prior information. To this end, we propose a method for directed sparse-reward goal-conditioned very long-horizon RL (DISCOVER), which selects exploratory goals in the direction of the target task. We connect DISCOVER to principled exploration in bandits, formally bounding the time until the target task becomes achievable in terms of the agent's initial distance to the target, but independent of the volume of the space of all tasks. We then perform a thorough evaluation in high-dimensional environments. We find that the directed goal selection of DISCOVER solves exploration problems that are beyond the reach of prior state-of-the-art exploration methods in RL.

Paper Structure

This paper contains 43 sections, 5 theorems, 30 equations, 16 figures, 2 tables, 1 algorithm.

Key Result

Proposition C.6

Let asm:modelasm:feedbackasm:estimates hold. Fix any $\delta \in (0,1), n \geq 1, \alpha \in (0,1)$, and let We then have with probability $1-\delta$ that the regret of selecting goals $g_n, g_{n+1}, \dots$ with DISCOVER($\alpha, \beta_t$) is bounded by

Figures (16)

  • Figure 1: Given a hard task, we compare agents learning to solve this target task by learning from the experience on simpler exploratory tasks. DISCOVER uses a bootstrapped sense of direction to design a curriculum of achievable and novel exploratory tasks that are relevant to the target task. In this way, the agent bootstraps to solve much harder tasks than if using other methods for selecting exploratory tasks, such as considering only direction or only achievability and novelty. State-of-the-art standard RL algorithms using intrinsic curiosity for exploration fail to achieve our target tasks at all.
  • Figure 2: Illustration of the goal selection of DISCOVER compared to prior goal selection strategies. The white cross represents the initial state of the agent, the red cross represents the target goal. The blue shaded area symbolizes the set of achieved goals $\mathcal{G}_\mathrm{ach}$. A lighter blue corresponds to harder to reach goals. Finally, the black crosses represent the kinds of goals selected by each strategy.
  • Figure 3: Comparison of the success rates on the target task over the course of training in the pointmaze, antmaze & arm environments. We compare DISCOVER to other strategies for goal selection. We consider two difficulty levels for each environment. We find that the DISCOVER agents learn to solve difficult target tasks significantly faster than the baselines.
  • Figure 4: Visualization of the selected goals of different goal selection strategies during the first 25M steps on the antmaze environment, colored by time step. DISCOVER balances exploring the environment with exploiting the agent's sense of direction to select goals relevant to the target task.
  • Figure 5: Comparison of using different strategies to determine direction, which replace the $V(g,g^\star)$ term in the DISCOVER objective. Hand-designed direction: $\|g-g^\star\|_2$; pre-trained direction: critic from training in a pointmaze environment with the same maze layout.
  • ...and 11 more figures

Theorems & Definitions (10)

  • Proposition C.6
  • proof
  • Lemma C.7
  • proof
  • Lemma C.8: Improvement lemma
  • proof
  • Theorem C.9
  • proof
  • Lemma C.10
  • proof