Table of Contents
Fetching ...

Intelligent Switching for Reset-Free RL

Darshan Patil, Janarthanan Rajendran, Glen Berseth, Sarath Chandar

TL;DR

This work tackles the lack of environment resets in real-world RL by proposing RISC, an intelligent switching framework between forward and reset controllers guided by a competency-based success critic. It emphasizes timeout-nonterminal bootstrapping to keep learning targets stable and introduces modulated switching to balance exploration and exploitation. Empirical results on the EARL benchmark and a four-room gridworld show RISC achieving state-of-the-art performance among reset-free methods and robust improvements over ablated variants. The approach offers practical advantages for real-world deployment by improving sample efficiency and reducing redundant exploration in well-learned regions.

Abstract

In the real world, the strong episode resetting mechanisms that are needed to train agents in simulation are unavailable. The \textit{resetting} assumption limits the potential of reinforcement learning in the real world, as providing resets to an agent usually requires the creation of additional handcrafted mechanisms or human interventions. Recent work aims to train agents (\textit{forward}) with learned resets by constructing a second (\textit{backward}) agent that returns the forward agent to the initial state. We find that the termination and timing of the transitions between these two agents are crucial for algorithm success. With this in mind, we create a new algorithm, Reset Free RL with Intelligently Switching Controller (RISC) which intelligently switches between the two agents based on the agent's confidence in achieving its current goal. Our new method achieves state-of-the-art performance on several challenging environments for reset-free RL.

Intelligent Switching for Reset-Free RL

TL;DR

This work tackles the lack of environment resets in real-world RL by proposing RISC, an intelligent switching framework between forward and reset controllers guided by a competency-based success critic. It emphasizes timeout-nonterminal bootstrapping to keep learning targets stable and introduces modulated switching to balance exploration and exploitation. Empirical results on the EARL benchmark and a four-room gridworld show RISC achieving state-of-the-art performance among reset-free methods and robust improvements over ablated variants. The approach offers practical advantages for real-world deployment by improving sample efficiency and reducing redundant exploration in well-learned regions.

Abstract

In the real world, the strong episode resetting mechanisms that are needed to train agents in simulation are unavailable. The \textit{resetting} assumption limits the potential of reinforcement learning in the real world, as providing resets to an agent usually requires the creation of additional handcrafted mechanisms or human interventions. Recent work aims to train agents (\textit{forward}) with learned resets by constructing a second (\textit{backward}) agent that returns the forward agent to the initial state. We find that the termination and timing of the transitions between these two agents are crucial for algorithm success. With this in mind, we create a new algorithm, Reset Free RL with Intelligently Switching Controller (RISC) which intelligently switches between the two agents based on the agent's confidence in achieving its current goal. Our new method achieves state-of-the-art performance on several challenging environments for reset-free RL.
Paper Structure (22 sections, 7 equations, 11 figures, 5 tables, 2 algorithms)

This paper contains 22 sections, 7 equations, 11 figures, 5 tables, 2 algorithms.

Figures (11)

  • Figure 1: In RL, the agent usually starts learning about the task in areas around rewarding states, and eventually propagates the learning to other parts of the state space. (Left) In episodic learning, the agent starts its trajectories at a state in the initial state distribution. Through exploration, it might find a trajectory that produces a reward, but might struggle to reach that goal state again, particularly on sparse reward or long horizon tasks. (Center) A common approach in Reset-Free RL is to build a curriculum outward from the task's goal states. While this allows the agent to frequently visit rewarding states, it also means the majority of the agents experience will be generated in areas it has already learned. (Right) RISC switches directions when it feels confident in its ability to achieve its current goal (both in the forward (task's goal states) or the backward direction (task's intial states)). This not only reduces the time spent in already explored regions of the state space, but also reduces the average distance to the goal which makes it easier for the agent to find high value states.
  • Figure 2: The environments used in Section \ref{['sec:experiments']}. The first 4 are part of the EARL benchmark sharmaAutonomousReinforcementLearning2021, and the last is based on Minigrid chevalier-boisvertMinimalisticGridworldEnvironment2018.
  • Figure 3: The heatmaps show the progression of the Q-values (top) and the visitation frequencies (middle) of the agent in forward mode in the $2000$ steps after the noted timestep. The reverse curriculum method tends to be biased towards states that it has already learned, while RISC is more evenly distributed, and slightly biased towards states it hasn't fully learned.
  • Figure 4: 95% confidence intervals for the interquartile mean (IQM) and mean normalized performance of RISC and other baselines aggregated across environments on the EARL benchmark (Tabletop Manipulation, Sawyer Door, Sawyer Peg, and Minitaur). Because MEDAL and VapRL require demonstrations and thus do not work on Minitaur, we exclude Minitaur from their calculations (left). IQM of RISC and other baselines on EARL benchmark as a function of progress through the training run. Shaded regions represent 95% confidence intervals (right). RISC outperforms and learns much faster than other reset-free baselines.
  • Figure 5: Average test returns on EARL benchmark tasks over timestep. RISC improves upon or matches the state-of-the-art for reset-free algorithms on 3 of the 4 environments (Tabletop Manipulation, Sawyer Door, Sawyer Peg), and even outperforms/matches the Episodic baseline on Sawyer Door and Sawyer Peg. Learning curves and code were not available for several baselines for Minitaur, so only the final performance is plotted. Results are averaged over 5 seeds and the shaded regions represent standard error.
  • ...and 6 more figures