Table of Contents
Fetching ...

URLB: Unsupervised Reinforcement Learning Benchmark

Michael Laskin, Denis Yarats, Hao Liu, Kimin Lee, Albert Zhan, Kevin Lu, Catherine Cang, Lerrel Pinto, Pieter Abbeel

TL;DR

URLB introduces a unified benchmark for unsupervised RL by combining reward-free pre-training with downstream fine-tuning on 12 tasks across 3 DeepMind Control Suite domains, and releases a unified codebase with eight baselines using a common optimization backbone. The study demonstrates that none of the baselines solve URLB within the prescribed budgets, highlighting gaps in representation learning, exploration, and fine-tuning strategies. It reveals that longer pre-training does not always improve adaptation and that competence-based methods generally underperform compared to data- and knowledge-based approaches. URLB provides a transparent, reproducible framework to drive progress in unsupervised RL and guides future research toward more scalable and robust pre-training methods.

Abstract

Deep Reinforcement Learning (RL) has emerged as a powerful paradigm to solve a range of complex yet specific control tasks. Yet training generalist agents that can quickly adapt to new tasks remains an outstanding challenge. Recent advances in unsupervised RL have shown that pre-training RL agents with self-supervised intrinsic rewards can result in efficient adaptation. However, these algorithms have been hard to compare and develop due to the lack of a unified benchmark. To this end, we introduce the Unsupervised Reinforcement Learning Benchmark (URLB). URLB consists of two phases: reward-free pre-training and downstream task adaptation with extrinsic rewards. Building on the DeepMind Control Suite, we provide twelve continuous control tasks from three domains for evaluation and open-source code for eight leading unsupervised RL methods. We find that the implemented baselines make progress but are not able to solve URLB and propose directions for future research.

URLB: Unsupervised Reinforcement Learning Benchmark

TL;DR

URLB introduces a unified benchmark for unsupervised RL by combining reward-free pre-training with downstream fine-tuning on 12 tasks across 3 DeepMind Control Suite domains, and releases a unified codebase with eight baselines using a common optimization backbone. The study demonstrates that none of the baselines solve URLB within the prescribed budgets, highlighting gaps in representation learning, exploration, and fine-tuning strategies. It reveals that longer pre-training does not always improve adaptation and that competence-based methods generally underperform compared to data- and knowledge-based approaches. URLB provides a transparent, reproducible framework to drive progress in unsupervised RL and guides future research toward more scalable and robust pre-training methods.

Abstract

Deep Reinforcement Learning (RL) has emerged as a powerful paradigm to solve a range of complex yet specific control tasks. Yet training generalist agents that can quickly adapt to new tasks remains an outstanding challenge. Recent advances in unsupervised RL have shown that pre-training RL agents with self-supervised intrinsic rewards can result in efficient adaptation. However, these algorithms have been hard to compare and develop due to the lack of a unified benchmark. To this end, we introduce the Unsupervised Reinforcement Learning Benchmark (URLB). URLB consists of two phases: reward-free pre-training and downstream task adaptation with extrinsic rewards. Building on the DeepMind Control Suite, we provide twelve continuous control tasks from three domains for evaluation and open-source code for eight leading unsupervised RL methods. We find that the implemented baselines make progress but are not able to solve URLB and propose directions for future research.

Paper Structure

This paper contains 22 sections, 11 equations, 8 figures, 5 tables, 1 algorithm.

Figures (8)

  • Figure 1: Unlike supervised RL which requires reward interaction at every step, unsupervised RL has two phases: (i) reward-free pre-training and (ii) fine-tuning to an extrinsic reward. During phase (i) an agent explores the environment through reward-free interaction with the environment. The quality of exploration depends on the intrinsic reward that the agent sets for itself. During phase (ii) the quality of pre-training is evaluated by its adaptation efficiency to a downstream task.
  • Figure 2: The three domains (walker, quadruped, jaco arm) and twelve downstream tasks considered in URLB. The environments include tasks of varying complexity and require an agent pre-trained on a given domain to adapt efficiently to the downstream tasks within that domain.
  • Figure 3: Aggregate results for each algorithm category after pre-training the agent with intrinsic rewards for 2M environment steps and finetuning with extrinisc rewards for 100k steps as described in Sec. \ref{['sec:eval']}. Scores are normalized by the asymptotic performance on each task (i.e., DrQ-v2 and DDPG performance after training from 2M steps on pixels and states correspondingly) and we show the mean and standard error of each category. Each algorithm is evaluated across ten random seeds. To provide an aggregate view of each algorithm category, the scores are averaged over individual tasks and methods (see Appendix \ref{['app:per_domain_results']} for detailed results for each algorithm and downstream task). The Random Init baseline represents DrQ-v2 and DDPG trained from a random initialization for 100k steps. Full results can be found in Section \ref{['app:per_domain_results']}.
  • Figure 4: We display the fine-tuning efficiency as a function of pre-training steps. As in Fig. \ref{['fig:main_result']} scores are asymptotically normalized, averaged across tasks and algorithms on a per-category basis, and evaluated over ten seeds. Our expectation is that a longer pre-training phase should lead to more efficient fine-tuning. However, in several cases the empirical evidence goes against our intuition demonstrating that longer pre-training is not always beneficial. Understanding this shortcoming of current methods is an important direction for future research. Detailed results can be found in Figures \ref{['fig:all_pretraining_steps_states']} and \ref{['fig:all_pretraining_steps_pixels']}.
  • Figure 5: Individual results of fine-tuning for 100k steps after different degrees of pre-training for each considered method. The performance is aggregated across all the tasks within a domain and normalized with respect to the optimal performance.
  • ...and 3 more figures