Learning to explore when mistakes are not allowed
Charly Pecqueux-Guézénec, Stéphane Doncieux, Nicolas Perrin-Gilbert
TL;DR
The paper tackles safe exploration in goal-conditioned reinforcement learning by marrying a pre-trained distributional safety policy with a goal-conditioned policy, using an arbitration mechanism driven by risk estimates. It introduces two phases: pretraining a safety policy with ensemble distributional critics and implementing a risk-aware action selector to switch to safety when needed during GC learning. Key contributions include a distributional safe RL framework with reachability critics, a three-strategy risk measure (Time, Constraint, Time-Constraint) for action selection, and extensive ablations and failure-mode analyses. The approach yields substantially fewer mistakes during exploration while maintaining competitive goal-space coverage in CartPoleGC and SkydioX2GC, indicating strong practical potential for safe real-world learning under safety constraints.
Abstract
Goal-Conditioned Reinforcement Learning (GCRL) provides a versatile framework for developing unified controllers capable of handling wide ranges of tasks, exploring environments, and adapting behaviors. However, its reliance on trial-and-error poses challenges for real-world applications, as errors can result in costly and potentially damaging consequences. To address the need for safer learning, we propose a method that enables agents to learn goal-conditioned behaviors that explore without the risk of making harmful mistakes. Exploration without risks can seem paradoxical, but environment dynamics are often uniform in space, therefore a policy trained for safety without exploration purposes can still be exploited globally. Our proposed approach involves two distinct phases. First, during a pretraining phase, we employ safe reinforcement learning and distributional techniques to train a safety policy that actively tries to avoid failures in various situations. In the subsequent safe exploration phase, a goal-conditioned (GC) policy is learned while ensuring safety. To achieve this, we implement an action-selection mechanism leveraging the previously learned distributional safety critics to arbitrate between the safety policy and the GC policy, ensuring safe exploration by switching to the safety policy when needed. We evaluate our method in simulated environments and demonstrate that it not only provides substantial coverage of the goal space but also reduces the occurrence of mistakes to a minimum, in stark contrast to traditional GCRL approaches. Additionally, we conduct an ablation study and analyze failure modes, offering insights for future research directions.
