Table of Contents
Fetching ...

Lyapunov-based Safe Policy Optimization for Continuous Control

Yinlam Chow, Ofir Nachum, Aleksandra Faust, Edgar Duenez-Guzman, Mohammad Ghavamzadeh

TL;DR

This work addresses safety in continuous-action reinforcement learning by casting it as a CMDP and introducing Lyapunov-based safe policy gradient methods. It develops two complementary approaches: (i) θ-projection, which optimizes a surrogate objective under Lyapunov-derived linearized constraints, and (ii) a-projection, which enforces safety via a Lyapunov safety layer that projects actions into the feasible set. Both methods are compatible with on-policy and off-policy PG algorithms, such as PPO and DDPG, and guarantee safety during updates, improving data efficiency relative to existing constrained RL methods. Empirical validation on MuJoCo benchmarks and a real indoor robot navigation task demonstrates stable learning, effective constraint satisfaction, and practical applicability to real-world safety-critical robotics. The work paves the way for scalable, end-to-end differentiable safe RL in continuous domains, with potential extensions to tighter Lyapunov constructions and model-based setups.

Abstract

We study continuous action reinforcement learning problems in which it is crucial that the agent interacts with the environment only through safe policies, i.e.,~policies that do not take the agent to undesirable situations. We formulate these problems as constrained Markov decision processes (CMDPs) and present safe policy optimization algorithms that are based on a Lyapunov approach to solve them. Our algorithms can use any standard policy gradient (PG) method, such as deep deterministic policy gradient (DDPG) or proximal policy optimization (PPO), to train a neural network policy, while guaranteeing near-constraint satisfaction for every policy update by projecting either the policy parameter or the action onto the set of feasible solutions induced by the state-dependent linearized Lyapunov constraints. Compared to the existing constrained PG algorithms, ours are more data efficient as they are able to utilize both on-policy and off-policy data. Moreover, our action-projection algorithm often leads to less conservative policy updates and allows for natural integration into an end-to-end PG training pipeline. We evaluate our algorithms and compare them with the state-of-the-art baselines on several simulated (MuJoCo) tasks, as well as a real-world indoor robot navigation problem, demonstrating their effectiveness in terms of balancing performance and constraint satisfaction. Videos of the experiments can be found in the following link: https://drive.google.com/file/d/1pzuzFqWIE710bE2U6DmS59AfRzqK2Kek/view?usp=sharing.

Lyapunov-based Safe Policy Optimization for Continuous Control

TL;DR

This work addresses safety in continuous-action reinforcement learning by casting it as a CMDP and introducing Lyapunov-based safe policy gradient methods. It develops two complementary approaches: (i) θ-projection, which optimizes a surrogate objective under Lyapunov-derived linearized constraints, and (ii) a-projection, which enforces safety via a Lyapunov safety layer that projects actions into the feasible set. Both methods are compatible with on-policy and off-policy PG algorithms, such as PPO and DDPG, and guarantee safety during updates, improving data efficiency relative to existing constrained RL methods. Empirical validation on MuJoCo benchmarks and a real indoor robot navigation task demonstrates stable learning, effective constraint satisfaction, and practical applicability to real-world safety-critical robotics. The work paves the way for scalable, end-to-end differentiable safe RL in continuous domains, with potential extensions to tighter Lyapunov constructions and model-based setups.

Abstract

We study continuous action reinforcement learning problems in which it is crucial that the agent interacts with the environment only through safe policies, i.e.,~policies that do not take the agent to undesirable situations. We formulate these problems as constrained Markov decision processes (CMDPs) and present safe policy optimization algorithms that are based on a Lyapunov approach to solve them. Our algorithms can use any standard policy gradient (PG) method, such as deep deterministic policy gradient (DDPG) or proximal policy optimization (PPO), to train a neural network policy, while guaranteeing near-constraint satisfaction for every policy update by projecting either the policy parameter or the action onto the set of feasible solutions induced by the state-dependent linearized Lyapunov constraints. Compared to the existing constrained PG algorithms, ours are more data efficient as they are able to utilize both on-policy and off-policy data. Moreover, our action-projection algorithm often leads to less conservative policy updates and allows for natural integration into an end-to-end PG training pipeline. We evaluate our algorithms and compare them with the state-of-the-art baselines on several simulated (MuJoCo) tasks, as well as a real-world indoor robot navigation problem, demonstrating their effectiveness in terms of balancing performance and constraint satisfaction. Videos of the experiments can be found in the following link: https://drive.google.com/file/d/1pzuzFqWIE710bE2U6DmS59AfRzqK2Kek/view?usp=sharing.

Paper Structure

This paper contains 18 sections, 1 theorem, 28 equations, 9 figures, 5 algorithms.

Key Result

Proposition 1

At any given state $x\in\mathcal{X}$, the solution to the optimization problem opt:safety_layer has the form $\pi_{\Xi(\pi_B,\theta)}(x)=\pi_{\theta}(x)+\lambda^*(x) \cdot g_{L_{\pi_B}}(x)$, where

Figures (9)

  • Figure 1: DDPG (red), DDPG-Lagrangian (cyan), SDDPG (blue), DDPG $a$-projection (green) on HalfCheetah-Safe and Point-Gather. Ours (SDDPG, SDDPG $a$-projection) perform stable and safe learning, although the dynamics and cost functions are not known, control actions are continuous, and deep function approximations are necessary. Unit of x-axis is in thousands of episodes. Shaded areas represent the $1$-SD confidence intervals (over $10$ random seeds). The dashed purple line represents the constraint limit.
  • Figure 2: PPO (red), PPO-Lagrangian (cyan), SPPO (blue), SPPO $a$-projection (green) on HalfCheetah-Safe and Point-Gather. Ours (PPO, SPPO $a$-projection) perform stable and safe learning, when the dynamics and cost functions are not known, control actions are continuous, and deep function approximations are necessary.
  • Figure 3: Robot navigation task details.
  • Figure 4: DDPG (red), DDPG-Lagrangian (cyan), SDDPG (blue), DDPG $a$-projection (green) on Robot Navigation. Ours (SDDPG, SDDPG $a$-projection) balance between reward and constraint learning. Unit of x-axis is in thousands of steps. The shaded areas represent the $1$-SD confidence intervals (over $50$ runs). The dashed purple line represents the constraint limit.
  • Figure 5: Navigation routes of two policies on a similar setup (a) and (b). Log of on-robot experiments (c). Larger version in Appendix \ref{['appendix:robot']} and the video is available in the supplementary materials.
  • ...and 4 more figures

Theorems & Definitions (1)

  • Proposition 1