A Safe Exploration Strategy for Model-free Task Adaptation in Safety-constrained Grid Environments
Erfan Entezami, Mahsa Sahebdel, Dhawal Gupta
TL;DR
This work tackles safe exploration for model-free reinforcement learning in safety-constrained grid environments by introducing a two-phase framework. A pre-training phase builds a BRS-based detector (via a binary classifier) that predicts potentially unsafe states, which is then applied to new environments to trigger a predefined safe policy during exploration. The approach reduces hazardous exploration, demonstrated on MiniGrid-style tasks with moving obstacles, showing fewer collisions and improved learning efficiency compared to standard epsilon-greedy strategies. The key contribution is a practical, pre-training-guided mechanism for safe exploration that does not require prior unsafe-state distributions or learning the safe policy online, enabling safer adaptation to new tasks.
Abstract
Training a model-free reinforcement learning agent requires allowing the agent to sufficiently explore the environment to search for an optimal policy. In safety-constrained environments, utilizing unsupervised exploration or a non-optimal policy may lead the agent to undesirable states, resulting in outcomes that are potentially costly or hazardous for both the agent and the environment. In this paper, we introduce a new exploration framework for navigating the grid environments that enables model-free agents to interact with the environment while adhering to safety constraints. Our framework includes a pre-training phase, during which the agent learns to identify potentially unsafe states based on both observable features and specified safety constraints in the environment. Subsequently, a binary classification model is trained to predict those unsafe states in new environments that exhibit similar dynamics. This trained classifier empowers model-free agents to determine situations in which employing random exploration or a suboptimal policy may pose safety risks, in which case our framework prompts the agent to follow a predefined safe policy to mitigate the potential for hazardous consequences. We evaluated our framework on three randomly generated grid environments and demonstrated how model-free agents can safely adapt to new tasks and learn optimal policies for new environments. Our results indicate that by defining an appropriate safe policy and utilizing a well-trained model to detect unsafe states, our framework enables a model-free agent to adapt to new tasks and environments with significantly fewer safety violations.
