Table of Contents
Fetching ...

Safety Representations for Safer Policy Learning

Kaustubh Mani, Vincent Mai, Charlie Gauthier, Annie Chen, Samer Nashed, Liam Paull

TL;DR

This work tackles unsafe, overly conservative exploration in reinforcement learning by proposing Safety Representations for Policy Learning (SRPL). SRPL learns a state-conditioned safety representation, modeled as a distribution over steps to unsafe states via a steps-to-cost (S2C) network, and augments the agent's state with this information to guide safer exploration. The safety distribution is learned from the agent's diverse experiences and can transfer across tasks, acting as an effective prior for new policies. Empirical results across manipulation, navigation, and driving tasks show SRPL improves both safety during learning and task performance, and zero-shot or finetuned transfers demonstrate its practicality for safety-critical domains.

Abstract

Reinforcement learning algorithms typically necessitate extensive exploration of the state space to find optimal policies. However, in safety-critical applications, the risks associated with such exploration can lead to catastrophic consequences. Existing safe exploration methods attempt to mitigate this by imposing constraints, which often result in overly conservative behaviours and inefficient learning. Heavy penalties for early constraint violations can trap agents in local optima, deterring exploration of risky yet high-reward regions of the state space. To address this, we introduce a method that explicitly learns state-conditioned safety representations. By augmenting the state features with these safety representations, our approach naturally encourages safer exploration without being excessively cautious, resulting in more efficient and safer policy learning in safety-critical scenarios. Empirical evaluations across diverse environments show that our method significantly improves task performance while reducing constraint violations during training, underscoring its effectiveness in balancing exploration with safety.

Safety Representations for Safer Policy Learning

TL;DR

This work tackles unsafe, overly conservative exploration in reinforcement learning by proposing Safety Representations for Policy Learning (SRPL). SRPL learns a state-conditioned safety representation, modeled as a distribution over steps to unsafe states via a steps-to-cost (S2C) network, and augments the agent's state with this information to guide safer exploration. The safety distribution is learned from the agent's diverse experiences and can transfer across tasks, acting as an effective prior for new policies. Empirical results across manipulation, navigation, and driving tasks show SRPL improves both safety during learning and task performance, and zero-shot or finetuned transfers demonstrate its practicality for safety-critical domains.

Abstract

Reinforcement learning algorithms typically necessitate extensive exploration of the state space to find optimal policies. However, in safety-critical applications, the risks associated with such exploration can lead to catastrophic consequences. Existing safe exploration methods attempt to mitigate this by imposing constraints, which often result in overly conservative behaviours and inefficient learning. Heavy penalties for early constraint violations can trap agents in local optima, deterring exploration of risky yet high-reward regions of the state space. To address this, we introduce a method that explicitly learns state-conditioned safety representations. By augmenting the state features with these safety representations, our approach naturally encourages safer exploration without being excessively cautious, resulting in more efficient and safer policy learning in safety-critical scenarios. Empirical evaluations across diverse environments show that our method significantly improves task performance while reducing constraint violations during training, underscoring its effectiveness in balancing exploration with safety.

Paper Structure

This paper contains 31 sections, 3 equations, 20 figures, 2 tables.

Figures (20)

  • Figure 1: To motivate the benefit of learning state-conditioned safety representations in safety-critical applications, we perform experiments on the Island Navigation environment ai-grid. We assume access to the Manhattan distance from the nearest water cell as ground truth (GT) safety information. (Col 1) shows that without this information, penalties due to failure early in the learning process bias the agent toward overly conservative behaviour resulting in suboptimal policies that avoid water but fail to complete the task. (Col 2) compares the Q-value estimates of a DQN agent with and without GT safety information across all states over multiple episodes. Without safety information, the agent fails to distinguish between risky and less risky states, producing uniformly low Q-values across all states. This highlights the inability of RL agents to learn good safety representations using reward signals alone. (Col 3) examines state visitation patterns over episodes, showing that while both agents initially explore the environment, the agent without safety information quickly reduces exploration, oscillating between two states to avoid failure but failing to reach the goal state while the agent with safety information is able to explore a larger region of the state-space. More detailed discussion in Sec. \ref{['sec:motivating_example']}
  • Figure 2: SPRL Framework: SPRL explicitly learns safety representations for states as distribution over proximity to unsafe states (cost-inducing states) through a steps-to-cost (S2C) model and uses this information to implicitly guide policy learning towards exploring safer regions of the state space.
  • Figure 3: Safety Representation: We demonstrate the S2C model's output on two different states ($A$ & $B$). A being farther from the water cells (indicated in blue) has its peak at 3 (distance from the unsafe set) while $B$ is a more risky state and has its peak at 1.
  • Figure 4: Performance of SRPL agents (denoted SR-*) on four different tasks. For these experiments, both the S2C model and the policy have been randomly initialized so no prior information has been provided to the agent. SRPL agents consistently outperform their baseline counterparts on both safety during learning as well as sample efficiency. Results were obtained by averaging the training runs across five seeds. The input to the RL agent is state-based in the form of joint states or LiDAR points.
  • Figure 5: Off-policy results for CSC and CVPO and their SRPL counterparts on Safety Gym environments over $2M$ timesteps. While CSC has better constraint satisfaction it also leads to suboptimal performance, CVPO has better sample efficiency which is further improved by SRPL.
  • ...and 15 more figures