Table of Contents
Fetching ...

Learning to explore when mistakes are not allowed

Charly Pecqueux-Guézénec, Stéphane Doncieux, Nicolas Perrin-Gilbert

TL;DR

The paper tackles safe exploration in goal-conditioned reinforcement learning by marrying a pre-trained distributional safety policy with a goal-conditioned policy, using an arbitration mechanism driven by risk estimates. It introduces two phases: pretraining a safety policy with ensemble distributional critics and implementing a risk-aware action selector to switch to safety when needed during GC learning. Key contributions include a distributional safe RL framework with reachability critics, a three-strategy risk measure (Time, Constraint, Time-Constraint) for action selection, and extensive ablations and failure-mode analyses. The approach yields substantially fewer mistakes during exploration while maintaining competitive goal-space coverage in CartPoleGC and SkydioX2GC, indicating strong practical potential for safe real-world learning under safety constraints.

Abstract

Goal-Conditioned Reinforcement Learning (GCRL) provides a versatile framework for developing unified controllers capable of handling wide ranges of tasks, exploring environments, and adapting behaviors. However, its reliance on trial-and-error poses challenges for real-world applications, as errors can result in costly and potentially damaging consequences. To address the need for safer learning, we propose a method that enables agents to learn goal-conditioned behaviors that explore without the risk of making harmful mistakes. Exploration without risks can seem paradoxical, but environment dynamics are often uniform in space, therefore a policy trained for safety without exploration purposes can still be exploited globally. Our proposed approach involves two distinct phases. First, during a pretraining phase, we employ safe reinforcement learning and distributional techniques to train a safety policy that actively tries to avoid failures in various situations. In the subsequent safe exploration phase, a goal-conditioned (GC) policy is learned while ensuring safety. To achieve this, we implement an action-selection mechanism leveraging the previously learned distributional safety critics to arbitrate between the safety policy and the GC policy, ensuring safe exploration by switching to the safety policy when needed. We evaluate our method in simulated environments and demonstrate that it not only provides substantial coverage of the goal space but also reduces the occurrence of mistakes to a minimum, in stark contrast to traditional GCRL approaches. Additionally, we conduct an ablation study and analyze failure modes, offering insights for future research directions.

Learning to explore when mistakes are not allowed

TL;DR

The paper tackles safe exploration in goal-conditioned reinforcement learning by marrying a pre-trained distributional safety policy with a goal-conditioned policy, using an arbitration mechanism driven by risk estimates. It introduces two phases: pretraining a safety policy with ensemble distributional critics and implementing a risk-aware action selector to switch to safety when needed during GC learning. Key contributions include a distributional safe RL framework with reachability critics, a three-strategy risk measure (Time, Constraint, Time-Constraint) for action selection, and extensive ablations and failure-mode analyses. The approach yields substantially fewer mistakes during exploration while maintaining competitive goal-space coverage in CartPoleGC and SkydioX2GC, indicating strong practical potential for safe real-world learning under safety constraints.

Abstract

Goal-Conditioned Reinforcement Learning (GCRL) provides a versatile framework for developing unified controllers capable of handling wide ranges of tasks, exploring environments, and adapting behaviors. However, its reliance on trial-and-error poses challenges for real-world applications, as errors can result in costly and potentially damaging consequences. To address the need for safer learning, we propose a method that enables agents to learn goal-conditioned behaviors that explore without the risk of making harmful mistakes. Exploration without risks can seem paradoxical, but environment dynamics are often uniform in space, therefore a policy trained for safety without exploration purposes can still be exploited globally. Our proposed approach involves two distinct phases. First, during a pretraining phase, we employ safe reinforcement learning and distributional techniques to train a safety policy that actively tries to avoid failures in various situations. In the subsequent safe exploration phase, a goal-conditioned (GC) policy is learned while ensuring safety. To achieve this, we implement an action-selection mechanism leveraging the previously learned distributional safety critics to arbitrate between the safety policy and the GC policy, ensuring safe exploration by switching to the safety policy when needed. We evaluate our method in simulated environments and demonstrate that it not only provides substantial coverage of the goal space but also reduces the occurrence of mistakes to a minimum, in stark contrast to traditional GCRL approaches. Additionally, we conduct an ablation study and analyze failure modes, offering insights for future research directions.

Paper Structure

This paper contains 39 sections, 5 equations, 12 figures, 3 tables, 3 algorithms.

Figures (12)

  • Figure 1: Action selection mechanism to guarantee safe exploration. The agent observes the current state $s$ of the environment and the current goal $g$. The safety policy samples an action $a_S \sim \pi_{\phi_S}(.|s)$ that must prevent future mistakes, while the goal-conditioned policy samples an action $a_{GC} \sim \pi_{\phi_{GC}}(.|s,g)$ to go towards $g$. For each possibility, the function $\sigma^{\pi_{\phi_S}}$ estimates the level of confidence in the safety policy's ability to avoid potential future errors. If it is too low, the safe action $a_S$ is executed to keep the system safe. Otherwise, $a_{GC}$ is executed, allowing the agent to explore.
  • Figure 2: Environments
  • Figure 3: Comparison between our method (Cf $L\&S$ in section \ref{['subsec:ablation_dist']}) and the baseline on the CartPoleGC environment in terms of safety during exploration and coverage.
  • Figure 4: Comparison between our method (Cf $S$ in section \ref{['subsec:ablation_dist']}) and the baseline on the SkydioX2GC environment in terms of safety during exploration and coverage.
  • Figure 5: Effect of the reachability critics on safe exploration with CartPoleGC for different variants. $L\&S$ (reachability in safety learning and action selection), $L$ (learning only), $S$ (action selection only), $None$ (no reachability)
  • ...and 7 more figures