A Single Goal is All You Need: Skills and Exploration Emerge from Contrastive RL without Rewards, Demonstrations, or Subgoals

Grace Liu; Michael Tang; Benjamin Eysenbach

A Single Goal is All You Need: Skills and Exploration Emerge from Contrastive RL without Rewards, Demonstrations, or Subgoals

Grace Liu, Michael Tang, Benjamin Eysenbach

TL;DR

Addressing exploration in long-horizon, sparse-reward reinforcement learning, the paper demonstrates that skills and directed exploration can emerge from a simple single-goal contrastive RL (CRL) approach. Data collection is anchored to a fixed target state $s^*$, while the actor is trained with multiple goals, leveraging an infoNCE-based contrastive objective to learn representations that align actions with the goal. Empirically, the method yields emergent, increasingly complex manipulation skills, diverse strategies across seeds, and robust performance across four challenging tasks, often outperforming subgoal curricula and dense-reward baselines without rewards, demonstrations, or extra hyperparameters. The authors acknowledge a lack of theoretical understanding and propose future work on broader domains and alternative exploration signals, highlighting the potential of single-goal CRL to simplify and scale exploration in RL.

Abstract

In this paper, we present empirical evidence of skills and directed exploration emerging from a simple RL algorithm long before any successful trials are observed. For example, in a manipulation task, the agent is given a single observation of the goal state and learns skills, first for moving its end-effector, then for pushing the block, and finally for picking up and placing the block. These skills emerge before the agent has ever successfully placed the block at the goal location and without the aid of any reward functions, demonstrations, or manually-specified distance metrics. Once the agent has learned to reach the goal state reliably, exploration is reduced. Implementing our method involves a simple modification of prior work and does not require density estimates, ensembles, or any additional hyperparameters. Intuitively, the proposed method seems like it should be terrible at exploration, and we lack a clear theoretical understanding of why it works so effectively, though our experiments provide some hints.

A Single Goal is All You Need: Skills and Exploration Emerge from Contrastive RL without Rewards, Demonstrations, or Subgoals

TL;DR

, while the actor is trained with multiple goals, leveraging an infoNCE-based contrastive objective to learn representations that align actions with the goal. Empirically, the method yields emergent, increasingly complex manipulation skills, diverse strategies across seeds, and robust performance across four challenging tasks, often outperforming subgoal curricula and dense-reward baselines without rewards, demonstrations, or extra hyperparameters. The authors acknowledge a lack of theoretical understanding and propose future work on broader domains and alternative exploration signals, highlighting the potential of single-goal CRL to simplify and scale exploration in RL.

Abstract

Paper Structure (32 sections, 4 equations, 14 figures, 2 tables, 1 algorithm)

This paper contains 32 sections, 4 equations, 14 figures, 2 tables, 1 algorithm.

Introduction
Related Work
Rewards and demonstrations.
Exploration and subgoal sampling.
Multi-task learning for single-task problems
Single-Goal Exploration with Contrastive RL
Preliminaries
Notation.
Contrastive RL.
Our Approach
Experiments
Tasks.
Single-goal Exploration is Exceedingly Effective.
A single goal works well.
Early training: agent develops an emergent curriculum of skills.
...and 17 more sections

Figures (14)

Figure 1: Skills and Directed Exploration Emerge. In this bin picking task, we provide the agent with a single goal observation where the green block is in the left bin. The agent never receives any rewards (not even sparse rewards). Throughout the course of training, the agent learns skills that increase in complexity. Easier skills seem to enable the agent to unlock more complex skills: moving the hand is a prerequisite for pushing the object; closing the gripper is a prerequisite for picking up the object, which is a prerequisite for moving the object to the left bin.
Figure 2: Single-goal exploration.(Left) Our method uses a single difficult goal for both data collection and evaluation. It is exceedingly unlikely that a random policy would ever reach this goal. (Right) Typical methods for goal-conditioned RL use a range of different goals for exploration, even if the user only cares about success at reaching a single difficult goal. These different goals can be provided by the user eysenbach2023contrastive or generated with a GAN florensa2018automatic, VAE nasiriany2019planning, or planning chane2021goalsavinov2018semizhang2021c.
Figure 3: Single goal Exploration is Highly Effective. We compare single hard goal exploration (command the single hard goal in every trial) to "range of difficulties" exploration (sampling uniformly from a human-provided set of easy/medium/hard goals). In each of the four environments, single-goal exploration yields considerably higher success rates, all while being easier for the human user.
Figure 4: Skills and Directed Exploration for Putting a Lid on a Box: This manipulation task contains an open box and a lid. The single fixed goal has the lid placed neatly on top of the center of the box. The images above show skills acquired throughout the course of learning. Note that some skills unlock subsequent skills (e.g., reaching is a prerequisite for picking, which is a prerequisite for placing) while others look like open-ended "play" (flipping the lid over, pushing the lid away from the box).
Figure 5: Skills and Directed Exploration for Peg Insertion: This manipulation task contains a peg and box with a narrow hole; the single fixed goal is a state where the peg is inside the hole. The agent acquires a sequence of increasingly complex skills throughout training, some of which are important for solving the task (e.g., reaching, grasping) while others are more "playful" (e.g., knocking the peg against the box). The agent also learns to recover from mistakes (see Fig. \ref{['fig:perturb']}).
...and 9 more figures

A Single Goal is All You Need: Skills and Exploration Emerge from Contrastive RL without Rewards, Demonstrations, or Subgoals

TL;DR

Abstract

A Single Goal is All You Need: Skills and Exploration Emerge from Contrastive RL without Rewards, Demonstrations, or Subgoals

Authors

TL;DR

Abstract

Table of Contents

Figures (14)