Table of Contents
Fetching ...

Safe Deep Model-Based Reinforcement Learning with Lyapunov Functions

Harry Zhang

TL;DR

This work tackles safety and stability in deep model-based reinforcement learning with unknown dynamics by integrating a Lyapunov-constrained value function into the Safety Augmented Value Estimation from Demonstrations (SAVED) framework, yielding SALVED. The approach learns a Lyapunov neural network to produce a stabilizing terminal cost within a Learning MPC setting, enforcing both safety during exploration and a monotone decrease of the Lyapunov function along trajectories. Empirical results in simulated 4D navigation tasks show SALVED improves stability, task completion, and constraint satisfaction while maintaining sample efficiency, with trajectories that exhibit reduced variance and fewer local minima. The framework offers a practical path toward safer, more reliable deep MBRL for control under unknown dynamics, with potential extensions to physical robots and stronger stability guarantees such as asymptotic stability.

Abstract

Model-based Reinforcement Learning (MBRL) has shown many desirable properties for intelligent control tasks. However, satisfying safety and stability constraints during training and rollout remains an open question. We propose a new Model-based RL framework to enable efficient policy learning with unknown dynamics based on learning model predictive control (LMPC) framework with mathematically provable guarantees of stability. We introduce and explore a novel method for adding safety constraints for model-based RL during training and policy learning. The new stability-augmented framework consists of a neural-network-based learner that learns to construct a Lyapunov function, and a model-based RL agent to consistently complete the tasks while satisfying user-specified constraints given only sub-optimal demonstrations and sparse-cost feedback. We demonstrate the capability of the proposed framework through simulated experiments.

Safe Deep Model-Based Reinforcement Learning with Lyapunov Functions

TL;DR

This work tackles safety and stability in deep model-based reinforcement learning with unknown dynamics by integrating a Lyapunov-constrained value function into the Safety Augmented Value Estimation from Demonstrations (SAVED) framework, yielding SALVED. The approach learns a Lyapunov neural network to produce a stabilizing terminal cost within a Learning MPC setting, enforcing both safety during exploration and a monotone decrease of the Lyapunov function along trajectories. Empirical results in simulated 4D navigation tasks show SALVED improves stability, task completion, and constraint satisfaction while maintaining sample efficiency, with trajectories that exhibit reduced variance and fewer local minima. The framework offers a practical path toward safer, more reliable deep MBRL for control under unknown dynamics, with potential extensions to physical robots and stronger stability guarantees such as asymptotic stability.

Abstract

Model-based Reinforcement Learning (MBRL) has shown many desirable properties for intelligent control tasks. However, satisfying safety and stability constraints during training and rollout remains an open question. We propose a new Model-based RL framework to enable efficient policy learning with unknown dynamics based on learning model predictive control (LMPC) framework with mathematically provable guarantees of stability. We introduce and explore a novel method for adding safety constraints for model-based RL during training and policy learning. The new stability-augmented framework consists of a neural-network-based learner that learns to construct a Lyapunov function, and a model-based RL agent to consistently complete the tasks while satisfying user-specified constraints given only sub-optimal demonstrations and sparse-cost feedback. We demonstrate the capability of the proposed framework through simulated experiments.
Paper Structure (14 sections, 11 equations, 4 figures)

This paper contains 14 sections, 11 equations, 4 figures.

Figures (4)

  • Figure 1: Planning trajectories of the MPC policy in SAVED in Pointbot 1 (point mass navigation) environment, where the value of the heat map is the predicted cost-to-go (negative returns) according to the value function at each iteration. In the plots, the $x$-axis is [-130, -80] and the $y$-axis is [-10, 10]. The starting position $x_0$ is $(-100, 0)$ and the goal is at the origin, $(0, 0)$. We see that even in an environment with no obstacles or constraints, SAVED struggles to complete the task due to local minima.
  • Figure 2: Pointbot Navigation Tasks: SALVED is evaluated on four Pointbot navigation tasks. Pointbot 1 has no obstacles or safety constraints, Pointbot 2-4 contain obstacles with increasingly more complex constraints. We observe that SALVED has a lower iteration cost than baselines throughout all four tasks and shows considerably more stability during training and evaluation.
  • Figure 3: The plots show evaluation trajectories combined across 100 iterations. The left plots show that the control policies learned by SAVED are considerably less optimal and have lower task completion rates across the experiments. The right plots demonstrate higher task completion rates and suggest that the policies learned by SALVED are significantly more optimal and consistent as the resulting trajectories are also more dense.
  • Figure :