Table of Contents
Fetching ...

A Safe Exploration Strategy for Model-free Task Adaptation in Safety-constrained Grid Environments

Erfan Entezami, Mahsa Sahebdel, Dhawal Gupta

TL;DR

This work tackles safe exploration for model-free reinforcement learning in safety-constrained grid environments by introducing a two-phase framework. A pre-training phase builds a BRS-based detector (via a binary classifier) that predicts potentially unsafe states, which is then applied to new environments to trigger a predefined safe policy during exploration. The approach reduces hazardous exploration, demonstrated on MiniGrid-style tasks with moving obstacles, showing fewer collisions and improved learning efficiency compared to standard epsilon-greedy strategies. The key contribution is a practical, pre-training-guided mechanism for safe exploration that does not require prior unsafe-state distributions or learning the safe policy online, enabling safer adaptation to new tasks.

Abstract

Training a model-free reinforcement learning agent requires allowing the agent to sufficiently explore the environment to search for an optimal policy. In safety-constrained environments, utilizing unsupervised exploration or a non-optimal policy may lead the agent to undesirable states, resulting in outcomes that are potentially costly or hazardous for both the agent and the environment. In this paper, we introduce a new exploration framework for navigating the grid environments that enables model-free agents to interact with the environment while adhering to safety constraints. Our framework includes a pre-training phase, during which the agent learns to identify potentially unsafe states based on both observable features and specified safety constraints in the environment. Subsequently, a binary classification model is trained to predict those unsafe states in new environments that exhibit similar dynamics. This trained classifier empowers model-free agents to determine situations in which employing random exploration or a suboptimal policy may pose safety risks, in which case our framework prompts the agent to follow a predefined safe policy to mitigate the potential for hazardous consequences. We evaluated our framework on three randomly generated grid environments and demonstrated how model-free agents can safely adapt to new tasks and learn optimal policies for new environments. Our results indicate that by defining an appropriate safe policy and utilizing a well-trained model to detect unsafe states, our framework enables a model-free agent to adapt to new tasks and environments with significantly fewer safety violations.

A Safe Exploration Strategy for Model-free Task Adaptation in Safety-constrained Grid Environments

TL;DR

This work tackles safe exploration for model-free reinforcement learning in safety-constrained grid environments by introducing a two-phase framework. A pre-training phase builds a BRS-based detector (via a binary classifier) that predicts potentially unsafe states, which is then applied to new environments to trigger a predefined safe policy during exploration. The approach reduces hazardous exploration, demonstrated on MiniGrid-style tasks with moving obstacles, showing fewer collisions and improved learning efficiency compared to standard epsilon-greedy strategies. The key contribution is a practical, pre-training-guided mechanism for safe exploration that does not require prior unsafe-state distributions or learning the safe policy online, enabling safer adaptation to new tasks.

Abstract

Training a model-free reinforcement learning agent requires allowing the agent to sufficiently explore the environment to search for an optimal policy. In safety-constrained environments, utilizing unsupervised exploration or a non-optimal policy may lead the agent to undesirable states, resulting in outcomes that are potentially costly or hazardous for both the agent and the environment. In this paper, we introduce a new exploration framework for navigating the grid environments that enables model-free agents to interact with the environment while adhering to safety constraints. Our framework includes a pre-training phase, during which the agent learns to identify potentially unsafe states based on both observable features and specified safety constraints in the environment. Subsequently, a binary classification model is trained to predict those unsafe states in new environments that exhibit similar dynamics. This trained classifier empowers model-free agents to determine situations in which employing random exploration or a suboptimal policy may pose safety risks, in which case our framework prompts the agent to follow a predefined safe policy to mitigate the potential for hazardous consequences. We evaluated our framework on three randomly generated grid environments and demonstrated how model-free agents can safely adapt to new tasks and learn optimal policies for new environments. Our results indicate that by defining an appropriate safe policy and utilizing a well-trained model to detect unsafe states, our framework enables a model-free agent to adapt to new tasks and environments with significantly fewer safety violations.
Paper Structure (11 sections, 3 equations, 4 figures, 1 table)

This paper contains 11 sections, 3 equations, 4 figures, 1 table.

Figures (4)

  • Figure 1: During the pre-training phase, a binary classification model is trained using features extracted from both BRS and non-BRS states. This model is subsequently used in a new environment to identify situations where using $\epsilon$-greedy exploration strategy might pose a risk.
  • Figure 2: We designed a 10x10 grid environment (Pre-training zone) containing one moving obstacle and one goal state to train the BRS detection model, and we created three 15x15 randomly generated grid environments (Task 1 to 3), each containing one moving obstacle, five blocked states, and one goal state to evaluate the performance of our framework.
  • Figure 3: Training process of QLearning algorithm for the designed tasks. Diagrams on the left depict the mean of the average returns for the 10 most recent episodes, and those on the right illustrate the percentage of episodes that ended with collision. All values are obtained by running each test 20 times and getting the average results. In all experiments, we used $\gamma$ = 0.99, exploration rate = 0.2 and learning rate = 0.5 as our hyperparameters.
  • Figure 4: Training process of SARSA algorithm for the designed tasks. Similar to Figure 3, diagrams on the left depict the mean of the average returns for the 10 most recent episodes, and those on the right illustrate the percentage of episodes that ended with collision. All values are obtained by running each test 20 times and getting the average results. We used similar hyperparameters that were used for QLearning algorithm.