Table of Contents
Fetching ...

Constrained Reinforcement Learning with Smoothed Log Barrier Function

Baohe Zhang, Yuan Zhang, Lilli Frison, Thomas Brox, Joschka Bödecker

TL;DR

A new constrained RL method called CSAC-LB (Constrained Soft Actor-Critic with Log Barrier Function), which achieves competitive performance without any pre-training by applying a linear smoothed log barrier function to an additional safety critic.

Abstract

Reinforcement Learning (RL) has been widely applied to many control tasks and substantially improved the performances compared to conventional control methods in many domains where the reward function is well defined. However, for many real-world problems, it is often more convenient to formulate optimization problems in terms of rewards and constraints simultaneously. Optimizing such constrained problems via reward shaping can be difficult as it requires tedious manual tuning of reward functions with several interacting terms. Recent formulations which include constraints mostly require a pre-training phase, which often needs human expertise to collect data or assumes having a sub-optimal policy readily available. We propose a new constrained RL method called CSAC-LB (Constrained Soft Actor-Critic with Log Barrier Function), which achieves competitive performance without any pre-training by applying a linear smoothed log barrier function to an additional safety critic. It implements an adaptive penalty for policy learning and alleviates the numerical issues that are known to complicate the application of the log barrier function method. As a result, we show that with CSAC-LB, we achieve state-of-the-art performance on several constrained control tasks with different levels of difficulty and evaluate our methods in a locomotion task on a real quadruped robot platform.

Constrained Reinforcement Learning with Smoothed Log Barrier Function

TL;DR

A new constrained RL method called CSAC-LB (Constrained Soft Actor-Critic with Log Barrier Function), which achieves competitive performance without any pre-training by applying a linear smoothed log barrier function to an additional safety critic.

Abstract

Reinforcement Learning (RL) has been widely applied to many control tasks and substantially improved the performances compared to conventional control methods in many domains where the reward function is well defined. However, for many real-world problems, it is often more convenient to formulate optimization problems in terms of rewards and constraints simultaneously. Optimizing such constrained problems via reward shaping can be difficult as it requires tedious manual tuning of reward functions with several interacting terms. Recent formulations which include constraints mostly require a pre-training phase, which often needs human expertise to collect data or assumes having a sub-optimal policy readily available. We propose a new constrained RL method called CSAC-LB (Constrained Soft Actor-Critic with Log Barrier Function), which achieves competitive performance without any pre-training by applying a linear smoothed log barrier function to an additional safety critic. It implements an adaptive penalty for policy learning and alleviates the numerical issues that are known to complicate the application of the log barrier function method. As a result, we show that with CSAC-LB, we achieve state-of-the-art performance on several constrained control tasks with different levels of difficulty and evaluate our methods in a locomotion task on a real quadruped robot platform.
Paper Structure (19 sections, 1 theorem, 18 equations, 6 figures, 1 table, 1 algorithm)

This paper contains 19 sections, 1 theorem, 18 equations, 6 figures, 1 table, 1 algorithm.

Key Result

Proposition 1

The maximum gap between the optimal value of the constrained problem (Eq. eq:ori_cons_problem) and the optimal value obtained by minimizing the unconstrained objective defined in Eq. eq:csac_loss is bounded by 0, where $\mu > 1$, $m$ is the number of constraints and $\mu$ is the log barrier factor

Figures (6)

  • Figure 1: Upper Left: SafetyGymray2019benchmarking PointGoal1-v0 task. The red dot is the controlled agent. The agent is required to reach the goal area (green) and avoid going through blue pillars that are randomly generated. Lower Left: Unitree A1 unitree2018unitree in Mujocotodorov2012mujoco Simulator with randomized terrain. Right: Unitree A1 with a protective shield from Wu22CoRL_DayDreamer.
  • Figure 2: Left: Log barrier function with different $\mu$. The dashed line is the indicator function. When $x \rightarrow 0$, $\psi (x) \rightarrow \infty$, it leads to infinity penalty. When $\mu \rightarrow \infty$, the log barrier function $\psi (x)$ is close to a step function Mid: Linear Smoothed Log barrier function with different $\mu$. The dashed line is the indicator function. When $\mu \rightarrow \infty$, the linear smoothed log barrier function $\Tilde{\psi} (x)$ is close to a step function with infinity penalty when $x > 0$Right: Log barrier function (Blue curve) with Value Clipping. Due to the value clipping when $y=10$, the gradient vanishes.
  • Figure 3: The mean and standard deviation of the episodic return and cost in PointGoal1-v0 environment. Each baseline is trained with 5 seeds for 3e6 environmental steps and is evaluated every 2e4 training steps. CSAC-LB is able to achieve the best performance without a degrading of performance over time as seen for SAC-Lag.
  • Figure 4: The mean and standard deviation of the episodic return (Upper) and cost (Lower) in locomotion tasks. All agents are trained with 5 seeds for 1e6 training steps and evaluated every 4e3 steps. Only CSAC-LB (green) learns to solve both tasks. WCSAC stops training at the dashed line in right figures due to numerical issues during the training. SAC only learns to lose balance and terminate the episode, constantly violating the constraints.
  • Figure 5: Two gaits learned by CSAC-LB accordingly when different speeds are specified as constraints. When low speed is desired, the gait is close to walking as each of its leg leaves sequentially. When a higher speed is desired, the robot starts to use both of its feet to accelerate.
  • ...and 1 more figures

Theorems & Definitions (1)

  • Proposition 1