Table of Contents
Fetching ...

Safety Modulation: Enhancing Safety in Reinforcement Learning through Cost-Modulated Rewards

Hanping Zhang, Yuhong Guo

TL;DR

The paper addresses safe reinforcement learning under Constrained Markov Decision Processes (CMDPs) by introducing Safety Modulated Policy Optimization (SMPO). SMPO learns a safety critic $Q^c_\phi(s,a)$ to estimate future cumulative costs and modulates the standard reward $\mathcal{R}(s,a)$ with a cost-aware weighting function $f(Q^c_\phi(s,a))$, yielding a differentiable objective $\mathcal{M}(\mathcal{R})=f(Q^c_\phi(s,a))\mathcal{R}(s,a)$ that converts the CMDP into an unconstrained RL problem. The method derives a policy gradient that includes gradients through the safety critic and proposes a dynamic threshold schedule to balance exploration and safety during training. Experiments on Safety Gym across multiple tasks show SMPO maintains safety constraints while achieving superior or competitive rewards compared to baseline safe RL methods, validating the effectiveness of cost-aware reward modulation and safety critics for real-world safe RL deployment.

Abstract

Safe Reinforcement Learning (Safe RL) aims to train an RL agent to maximize its performance in real-world environments while adhering to safety constraints, as exceeding safety violation limits can result in severe consequences. In this paper, we propose a novel safe RL approach called Safety Modulated Policy Optimization (SMPO), which enables safe policy function learning within the standard policy optimization framework through safety modulated rewards. In particular, we consider safety violation costs as feedback from the RL environments that are parallel to the standard awards, and introduce a Q-cost function as safety critic to estimate expected future cumulative costs. Then we propose to modulate the rewards using a cost-aware weighting function, which is carefully designed to ensure the safety limits based on the estimation of the safety critic, while maximizing the expected rewards. The policy function and the safety critic are simultaneously learned through gradient descent during online interactions with the environment. We conduct experiments using multiple RL environments and the experimental results demonstrate that our method outperforms several classic and state-of-the-art comparison methods in terms of overall safe RL performance.

Safety Modulation: Enhancing Safety in Reinforcement Learning through Cost-Modulated Rewards

TL;DR

The paper addresses safe reinforcement learning under Constrained Markov Decision Processes (CMDPs) by introducing Safety Modulated Policy Optimization (SMPO). SMPO learns a safety critic to estimate future cumulative costs and modulates the standard reward with a cost-aware weighting function , yielding a differentiable objective that converts the CMDP into an unconstrained RL problem. The method derives a policy gradient that includes gradients through the safety critic and proposes a dynamic threshold schedule to balance exploration and safety during training. Experiments on Safety Gym across multiple tasks show SMPO maintains safety constraints while achieving superior or competitive rewards compared to baseline safe RL methods, validating the effectiveness of cost-aware reward modulation and safety critics for real-world safe RL deployment.

Abstract

Safe Reinforcement Learning (Safe RL) aims to train an RL agent to maximize its performance in real-world environments while adhering to safety constraints, as exceeding safety violation limits can result in severe consequences. In this paper, we propose a novel safe RL approach called Safety Modulated Policy Optimization (SMPO), which enables safe policy function learning within the standard policy optimization framework through safety modulated rewards. In particular, we consider safety violation costs as feedback from the RL environments that are parallel to the standard awards, and introduce a Q-cost function as safety critic to estimate expected future cumulative costs. Then we propose to modulate the rewards using a cost-aware weighting function, which is carefully designed to ensure the safety limits based on the estimation of the safety critic, while maximizing the expected rewards. The policy function and the safety critic are simultaneously learned through gradient descent during online interactions with the environment. We conduct experiments using multiple RL environments and the experimental results demonstrate that our method outperforms several classic and state-of-the-art comparison methods in terms of overall safe RL performance.

Paper Structure

This paper contains 23 sections, 14 equations, 6 figures, 1 algorithm.

Figures (6)

  • Figure 1: The main framework of our proposed SMPO approach. At each timestep $t$, transitions $\{(s_t,a_t, r_t,c_t)\}$ are collected through exploration employing the policy $\pi_\theta$. The observed cost $c_t$ contributes to the training of the safety critic $Q_\phi^{c}(s_t,a_t)$. Utilizing the safety critic, the cost-aware weighting function $f(Q_\phi^{c}(s_t,a_t))$ is applied to modulate the reward in a bilinear form $f(Q_\phi^{c}(s_t,a_t))\mathcal{R}(s_t,a_t)$ to facilitate safe policy learning.
  • Figure 2: Visualization of the cost-aware weighting function $f(Q_\phi^{c}(s_t,a_t))$ with a base value $b=3$. We utilize a fixed safety cost threshold of $d=25$, indicated by the vertical dashed grey line. Additionally, the horizontal dotted line in dark red illustrates the scenario where $f(Q_\phi^{c}(s_t,a_t))=0$ when the total estimated cost reaches the cost threshold $d$.
  • Figure 3: Comparison results in terms of average episode reward/cost vs. environment steps on the Point robot across three different tasks: Goal1, Button1, and Push1. The top row reports the results in terms of average episode rewards, while the bottom row reports the corresponding average episode costs. The results are averaged over three runs, with the shadow indicating standard deviations and the dashed black line indicating the cost threshold.
  • Figure 4: Comparison results in terms of average episode reward/cost vs. environment steps on the Point robot with Goal2---a task with higher difficulty, and the Car robot for two tasks---Goal1 and Goal2. The top row reports the results in terms of average episode rewards, while the bottom row reports the corresponding average episode costs. The results are averaged over three runs, with the shadow indicating standard deviations and the dashed black line indicating the cost threshold.
  • Figure 5: Results of ablation study for the proposed SMPO method on PointGoal1 and PointGoal2. SMPO is compared with three ablation variants: "-w/o Policy Gradient on $Q_\phi^c$", "-w/o Dynamic Schedule", and "-w/o Regularizer for $Q_\phi^c$ Learning". The results are averaged over three runs, with the shadow indicating standard deviations and the dashed black line indicating the cost threshold.
  • ...and 1 more figures

Theorems & Definitions (1)

  • Definition 1