Accelerating Goal-Conditioned RL Algorithms and Research

Michał Bortkiewicz; Władysław Pałucki; Vivek Myers; Tadeusz Dziarmaga; Tomasz Arczewski; Łukasz Kuciński; Benjamin Eysenbach

Accelerating Goal-Conditioned RL Algorithms and Research

Michał Bortkiewicz, Władysław Pałucki, Vivek Myers, Tadeusz Dziarmaga, Tomasz Arczewski, Łukasz Kuciński, Benjamin Eysenbach

TL;DR

The paper introduces JaxGCRL, a fast GPU-accelerated benchmark and codebase for self-supervised goal-conditioned reinforcement learning (GCRL). By combining GPU-accelerated simulators, a stable contrastive RL algorithm, and a suite of discrete, state-based tasks, it delivers up to $22\times$ faster training (e.g., $10$M steps in minutes on a single GPU) and enables rapid, iterative experimentation. The authors systematically study design choices in contrastive learning—energy functions, losses, and architecture scaling—and demonstrate robust performance across eight GCRL environments, with InfoNCE-based objectives and $L2$ energy often performing best in data-rich regimes. They also show that large architectures with layer normalization further boost performance and that broad data and architecture scaling can be achieved efficiently, highlighting CRL as a viable path for scalable self-supervised RL research. Overall, JaxGCRL lowers barriers to entry, accelerates hypothesis testing, and lays groundwork for future advances in self-supervised GCRL with broad practical impact.

Abstract

Self-supervision has the potential to transform reinforcement learning (RL), paralleling the breakthroughs it has enabled in other areas of machine learning. While self-supervised learning in other domains aims to find patterns in a fixed dataset, self-supervised goal-conditioned reinforcement learning (GCRL) agents discover new behaviors by learning from the goals achieved during unstructured interaction with the environment. However, these methods have failed to see similar success, both due to a lack of data from slow environment simulations as well as a lack of stable algorithms. We take a step toward addressing both of these issues by releasing a high-performance codebase and benchmark (JaxGCRL) for self-supervised GCRL, enabling researchers to train agents for millions of environment steps in minutes on a single GPU. By utilizing GPU-accelerated replay buffers, environments, and a stable contrastive RL algorithm, we reduce training time by up to $22\times$. Additionally, we assess key design choices in contrastive RL, identifying those that most effectively stabilize and enhance training performance. With this approach, we provide a foundation for future research in self-supervised GCRL, enabling researchers to quickly iterate on new ideas and evaluate them in diverse and challenging environments. Website + Code: https://github.com/MichalBortkiewicz/JaxGCRL

Accelerating Goal-Conditioned RL Algorithms and Research

TL;DR

faster training (e.g.,

M steps in minutes on a single GPU) and enables rapid, iterative experimentation. The authors systematically study design choices in contrastive learning—energy functions, losses, and architecture scaling—and demonstrate robust performance across eight GCRL environments, with InfoNCE-based objectives and

energy often performing best in data-rich regimes. They also show that large architectures with layer normalization further boost performance and that broad data and architecture scaling can be achieved efficiently, highlighting CRL as a viable path for scalable self-supervised RL research. Overall, JaxGCRL lowers barriers to entry, accelerates hypothesis testing, and lays groundwork for future advances in self-supervised GCRL with broad practical impact.

Abstract

. Additionally, we assess key design choices in contrastive RL, identifying those that most effectively stabilize and enhance training performance. With this approach, we provide a foundation for future research in self-supervised GCRL, enabling researchers to quickly iterate on new ideas and evaluate them in diverse and challenging environments. Website + Code: https://github.com/MichalBortkiewicz/JaxGCRL

Paper Structure (52 sections, 14 equations, 14 figures, 2 tables, 1 algorithm)

This paper contains 52 sections, 14 equations, 14 figures, 2 tables, 1 algorithm.

Introduction
Related Work
Goal-Conditioned Reinforcement Learning
Accelerating Deep Reinforcement Learning
Self-Supervised RL
RL Benchmarks
Preliminaries
Contrastive Critic Learning
Policy Learning
JaxGCRL: A New Benchmark and Implementation
JaxGCRL speedup on a single GPU
JaxGCRL Environments in the Benchmark
Reacher brockman2022openai.
Half Cheetah wawrzynskiCatLikeRobotRealTime2009.
Pusher brockman2022openai.
...and 37 more sections

Figures (14)

Figure 1: JaxGCRL is fast. It learns goal-reaching policies for Ant in 10 minutes on 1 GPU. This paper releases a GCRL benchmark and baseline algorithms that enable research and experiments to be done in minutes.
Figure 2: JaxGCRL benchmark: New suite of GPU-accelerated environments for studying GCRL. In this setting, the agent does not receive any rewards or demonstrations, making some of these tasks an excellent testbed for studying exploration and long-horizon reasoning. Our accompanying implementation of GCRL algorithms trains with more than $15$K environment steps per second on a single GPU, enabling rapid experimentation.
Figure 3: Baseline results in JaxGCRL benchmark. Success rates of all the baseline algorithms for $50$M environment steps for every JaxGCRL environment. CRL outperforms other baselines in most of the environments. The training speed is a function of the environment complexity, method complexity, and physics backend; see \ref{['app:benchamrk_perf']}. Specifically, due to differences in how each method works, the speed varies greatly in the same environments; this can be best seen with the PPO method being significantly faster than others due to it not using a replay buffer, which frees up GPU memory for more parallel environment simulations. Results are reported as the interquartile mean (IQM) along with its standard error, based on 10 seeds.
Figure 4: InfoNCE-based loss functions perform best. The critic loss functions that achieve the highest success rates are based on InfoNCE and DPO. However, DPO policies tend to stay at the goal for a shorter duration. IQMs averaged over 10 seeds and plotted with one standard error.
Figure 5: Scaling the critic and actor networks. Increasing the width and depth generally enhances performance, but performance levels off for deeper architectures at a width of $1024$. Aggregated metrics, 5 seeds per configuration.
...and 9 more figures

Accelerating Goal-Conditioned RL Algorithms and Research

TL;DR

Abstract

Accelerating Goal-Conditioned RL Algorithms and Research

Authors

TL;DR

Abstract

Table of Contents

Figures (14)