Table of Contents
Fetching ...

Off Policy Lyapunov Stability in Reinforcement Learning

Sarvan Gill, Daniela Constantinescu

TL;DR

This work addresses the lack of stability guarantees in deep reinforcement learning for robotics by introducing a method to learn Lyapunov functions off-policy and integrate them into state-of-the-art algorithms. A neural Lyapunov function $L_\eta(s,a)$ is learned using off-policy data with a policy-dependent Lie derivative $\mathcal{L}_{f,\Delta t} L_\eta$, plus a minimum decay rate $\mu$, and its expectation over actions $L_\eta(s) = \mathbb{E}_{a \sim \pi} L_\eta(s,a)$ acts as a stability certificate. The approach yields two algorithms, LSAC and LPPO, which incorporate the Lyapunov term into SAC and PPO respectively to achieve improved sample efficiency and robust stability in both a pendulum and a quadrotor control setting, though it relies on simulated validation and introduces new hyperparameters. Overall, the framework advances data-efficient, stability-aware RL for robotics with potential for formal guarantees and broader applicability in real-world systems.

Abstract

Traditional reinforcement learning lacks the ability to provide stability guarantees. More recent algorithms learn Lyapunov functions alongside the control policies to ensure stable learning. However, the current self-learned Lyapunov functions are sample inefficient due to their on-policy nature. This paper introduces a method for learning Lyapunov functions off-policy and incorporates the proposed off-policy Lyapunov function into the Soft Actor Critic and Proximal Policy Optimization algorithms to provide them with a data efficient stability certificate. Simulations of an inverted pendulum and a quadrotor illustrate the improved performance of the two algorithms when endowed with the proposed off-policy Lyapunov function.

Off Policy Lyapunov Stability in Reinforcement Learning

TL;DR

This work addresses the lack of stability guarantees in deep reinforcement learning for robotics by introducing a method to learn Lyapunov functions off-policy and integrate them into state-of-the-art algorithms. A neural Lyapunov function is learned using off-policy data with a policy-dependent Lie derivative , plus a minimum decay rate , and its expectation over actions acts as a stability certificate. The approach yields two algorithms, LSAC and LPPO, which incorporate the Lyapunov term into SAC and PPO respectively to achieve improved sample efficiency and robust stability in both a pendulum and a quadrotor control setting, though it relies on simulated validation and introduces new hyperparameters. Overall, the framework advances data-efficient, stability-aware RL for robotics with potential for formal guarantees and broader applicability in real-world systems.

Abstract

Traditional reinforcement learning lacks the ability to provide stability guarantees. More recent algorithms learn Lyapunov functions alongside the control policies to ensure stable learning. However, the current self-learned Lyapunov functions are sample inefficient due to their on-policy nature. This paper introduces a method for learning Lyapunov functions off-policy and incorporates the proposed off-policy Lyapunov function into the Soft Actor Critic and Proximal Policy Optimization algorithms to provide them with a data efficient stability certificate. Simulations of an inverted pendulum and a quadrotor illustrate the improved performance of the two algorithms when endowed with the proposed off-policy Lyapunov function.

Paper Structure

This paper contains 10 sections, 13 equations, 5 figures.

Figures (5)

  • Figure 1: The two proposed algorithms, LSAC (left) and LPPO (right). $J_{V_\psi}$, $J_{Q_\theta}$, $\bar{\psi}$ are defined in sac, and $J_{V_\theta}$ is defined in ppo.
  • Figure 2: Pendulum-v1 Experiment Results: (a) the reward of different algorithms during training, as function of the number of episodes, and with the shaded region showing one standard deviation over the 10 random seeds; (b) the loss \ref{['eq:lya-risk']} and the reward during training (y axis is normalized); (c) a sample trajectory for each algorithm after training is complete.
  • Figure 3: Level curves of the Lyapunov candidates learned by LSAC, POLYC and LAC. Grey dots represent pendulum states where the Lie derivative is negative. Red dots are pendulum states where the Lie derivative is positive.
  • Figure 4: The mean training rewards for LPPO, POLYC, and PPO on the Mujoco Quadrotor environment, obtained from ten random seeds and plotted with a one standard deviation shaded region.
  • Figure 5: Trajectory tracking for the quadrotor controlled by LPPO, POLYC, and PPO. The drone starts at the same starting point of $(x_0, y_0, z_0) \sim (1, 0, 2)$ for all three algorithms.