Table of Contents
Fetching ...

RLeXplore: Accelerating Research in Intrinsically-Motivated Reinforcement Learning

Mingqi Yuan, Roger Creus Castanyer, Bo Li, Xin Jin, Wenjun Zeng, Glen Berseth

TL;DR

RLeXplore provides a standardized, modular framework for eight intrinsic reward methods to accelerate research in intrinsically-motivated RL. By decoupling intrinsic reward modules from RL optimization and detailing implementation nuances, it enables fair comparisons, reproducibility, and rapid integration with existing libraries. The study demonstrates that careful design choices—such as normalization, update dynamics, and memory usage—substantially affect performance, and that combining intrinsic rewards can yield emergent, high-quality exploration in sparse or reward-free settings. The framework’s open-source resources and benchmarks support broader adoption and progression toward robust, autonomous RL agents operating with minimal extrinsic supervision.

Abstract

Extrinsic rewards can effectively guide reinforcement learning (RL) agents in specific tasks. However, extrinsic rewards frequently fall short in complex environments due to the significant human effort needed for their design and annotation. This limitation underscores the necessity for intrinsic rewards, which offer auxiliary and dense signals and can enable agents to learn in an unsupervised manner. Although various intrinsic reward formulations have been proposed, their implementation and optimization details are insufficiently explored and lack standardization, thereby hindering research progress. To address this gap, we introduce RLeXplore, a unified, highly modularized, and plug-and-play framework offering reliable implementations of eight state-of-the-art intrinsic reward methods. Furthermore, we conduct an in-depth study that identifies critical implementation details and establishes well-justified standard practices in intrinsically-motivated RL. Our documentation, examples, and source code are available at https://github.com/RLE-Foundation/RLeXplore.

RLeXplore: Accelerating Research in Intrinsically-Motivated Reinforcement Learning

TL;DR

RLeXplore provides a standardized, modular framework for eight intrinsic reward methods to accelerate research in intrinsically-motivated RL. By decoupling intrinsic reward modules from RL optimization and detailing implementation nuances, it enables fair comparisons, reproducibility, and rapid integration with existing libraries. The study demonstrates that careful design choices—such as normalization, update dynamics, and memory usage—substantially affect performance, and that combining intrinsic rewards can yield emergent, high-quality exploration in sparse or reward-free settings. The framework’s open-source resources and benchmarks support broader adoption and progression toward robust, autonomous RL agents operating with minimal extrinsic supervision.

Abstract

Extrinsic rewards can effectively guide reinforcement learning (RL) agents in specific tasks. However, extrinsic rewards frequently fall short in complex environments due to the significant human effort needed for their design and annotation. This limitation underscores the necessity for intrinsic rewards, which offer auxiliary and dense signals and can enable agents to learn in an unsupervised manner. Although various intrinsic reward formulations have been proposed, their implementation and optimization details are insufficiently explored and lack standardization, thereby hindering research progress. To address this gap, we introduce RLeXplore, a unified, highly modularized, and plug-and-play framework offering reliable implementations of eight state-of-the-art intrinsic reward methods. Furthermore, we conduct an in-depth study that identifies critical implementation details and establishes well-justified standard practices in intrinsically-motivated RL. Our documentation, examples, and source code are available at https://github.com/RLE-Foundation/RLeXplore.
Paper Structure (44 sections, 12 equations, 31 figures, 16 tables)

This paper contains 44 sections, 12 equations, 31 figures, 16 tables.

Figures (31)

  • Figure 1: The workflow of RLeXplore. (a) RLeXplore provides a decoupled module for intrinsic rewards that integrates seamlessly with the RL training loop. RLeXplore implements 8 SOTA intrinsic rewards and adapts to the unmodified RL training loop. (b) RLeXplore monitors the agent-environment interactions and gathers data samples using the [fontfamily=zlmtt, fontseries=b]python|.watch()| function. After collecting experience rollouts, RLeXplore computes the corresponding intrinsic rewards using the [fontfamily=zlmtt, fontseries=b]python|.compute()| function and updates the auxiliary models via the [fontfamily=zlmtt, fontseries=b]python|.update()| function. (c) RLeXplore provides a [fontfamily=zlmtt, fontseries=b]python|Fabric| class that allows developers to combine multiple intrinsic rewards in an elegant manner. In Appendix \ref{['app:new_rewards']} we provide more details on how to add new intrinsic rewards to RLeXplore.
  • Figure 2: Screenshots of the selected exploration games. (a) SuperMarioBros. (b) MiniGrid. (c) ALE-5. (d) Procgen-Maze. (e) Ant-UMaze.
  • Figure 3: Episode returns achieved by the intrinsic rewards in RLeXplore. (left) SuperMarioBros without access to the task rewards. (right) MiniGrid-DoorKey-16×16 with sparse rewards.
  • Figure 4: Results for Q1, Q2, Q3, Q4, and Q5 in SMB (top) and MGD (bottom), which are normalized by maximum score possibly achieved in the task. Here, Combined refers to the results of using the best hyperparameters gathered in each question. Since RE3 only employs a fixed, randomly initialized neural network for encoding observations, there are no values in Q3. All the results are aggregated over 10 seeds, and each run uses 10M environment interactions.
  • Figure 5: Aggregated performance of the eight intrinsic rewards with different low-level hyperparameters over 10 random seeds. The vertical dashed line represents the performance of the extrinsic agent, which only has access to the task rewards. Here, U. P. is the update proportion, O. N. is the observation normalization, R. N. is the reward normalization, IQM is the interquartile mean, OG is the optimality gap (lower is better), and Combined refers to the results of using the best hyperparameters gathered in each question. All the computation is performed using the Rliable agarwal2021deep library.
  • ...and 26 more figures