Reset-free Reinforcement Learning with World Models

Zhao Yang; Thomas M. Moerland; Mike Preuss; Aske Plaat; Edward S. Hu

Reset-free Reinforcement Learning with World Models

Zhao Yang, Thomas M. Moerland, Mike Preuss, Aske Plaat, Edward S. Hu

TL;DR

MoReFree tackles reset-free RL by extending model-based reinforcement learning with Back-and-Forth Go-Explore and imagination-driven goal learning to focus on task-relevant states. Using PEG as a backbone, it leverages a world model to plan and train in imagination while guiding exploration toward initial and evaluation states, reducing over-exploration of irrelevant regions. Across eight reset-free tasks, MoReFree and a reset-free PEG variant achieve superior data efficiency and final performance without environmental rewards or demonstrations, especially on hard tasks, highlighting the promise of world-model-based reset-free RL. The work also provides thorough analyses and ablations, indicating the critical synergistic roles of its exploration and imagination components and outlining directions for adaptive curricula and scalability.

Abstract

Reinforcement learning (RL) is an appealing paradigm for training intelligent agents, enabling policy acquisition from the agent's own autonomously acquired experience. However, the training process of RL is far from automatic, requiring extensive human effort to reset the agent and environments. To tackle the challenging reset-free setting, we first demonstrate the superiority of model-based (MB) RL methods in such setting, showing that a straightforward adaptation of MBRL can outperform all the prior state-of-the-art methods while requiring less supervision. We then identify limitations inherent to this direct extension and propose a solution called model-based reset-free (MoReFree) agent, which further enhances the performance. MoReFree adapts two key mechanisms, exploration and policy learning, to handle reset-free tasks by prioritizing task-relevant states. It exhibits superior data-efficiency across various reset-free tasks without access to environmental reward or demonstrations while significantly outperforming privileged baselines that require supervision. Our findings suggest model-based methods hold significant promise for reducing human effort in RL. Website: https://yangzhao-666.github.io/morefree

Reset-free Reinforcement Learning with World Models

TL;DR

Abstract

Paper Structure (30 sections, 6 equations, 22 figures, 2 tables, 3 algorithms)

This paper contains 30 sections, 6 equations, 22 figures, 2 tables, 3 algorithms.

Introduction
Related Work
Preliminaries
Reset-free RL
Model-based RL setup
Method
Back-and-Forth Go-Explore
Learning to Achieve Relevant Goals in Imagination
Implementation Details
Experiments
Results
Analysis
Ablations
Conclusion and Future Work
Broader Impacts
...and 15 more sections

Figures (22)

Figure 1: Performance and collected data of different agents on the reset-free Ant locomotion task.
Figure 2: MoReFree is a model-based RL agent for solving reset-free tasks. Top row: MoReFree strikes a balance between exploring unseen states and practicing optimal behavior in task-relevant regions by directing the goal-conditioned policy to achieve evaluation states, initial state states (emulating a reset), and exploratory goals. Bottom row: MoReFree focuses the goal-conditioned policy training inside the world model on achieving evaluation states, initial states, and random replay buffer states to better prepare the policy for the aforementioned exploration scheme.
Figure 3: We evaluate MoReFree on eight reset-free tasks ranging from navigation to manipulation. PP is short for Pick&Place.
Figure 4: Two reset-free MBRL methods (MoReFree and reset-free PEG) significantly outperform baselines in 7/8 tasks. However, directly applying MBRL methods (PEG and DreamerV2) works poorly. In 4 tasks, only MBRL methods are able to learn meaningful behavior, showcasing MBRL's sample efficiency in the reset-free setting. MoReFree outperforms reset-free PEG in the 3 most difficult tasks.
Figure 5: State visitation heatmaps of different agents. White areas are task-relevant states (including initial and goal state distributions) and we overlay the percentages of task-relevant states. reset-free MBRL methods explore more and in harder environments, MoReFree experiences more task-relevant states.
...and 17 more figures

Reset-free Reinforcement Learning with World Models

TL;DR

Abstract

Reset-free Reinforcement Learning with World Models

Authors

TL;DR

Abstract

Table of Contents

Figures (22)