Table of Contents
Fetching ...

Neural Lyapunov and Optimal Control

Daniel Layeghi, Steve Tonneau, Michael Mistry

TL;DR

This paper uses the Hamilton-Jacobi-Bellman (HJB) and first-order gradients to learn optimal time-varying value functions and therefore, policies and introduces an optimal control theoretic learning-based method that can solve the same problems robustly with simple parsimonious costs.

Abstract

Despite impressive results, reinforcement learning (RL) suffers from slow convergence and requires a large variety of tuning strategies. In this paper, we investigate the ability of RL algorithms on simple continuous control tasks. We show that without reward and environment tuning, RL suffers from poor convergence. In turn, we introduce an optimal control (OC) theoretic learning-based method that can solve the same problems robustly with simple parsimonious costs. We use the Hamilton-Jacobi-Bellman (HJB) and first-order gradients to learn optimal time-varying value functions and therefore, policies. We show the relaxation of our objective results in time-varying Lyapunov functions, further verifying our approach by providing guarantees over a compact set of initial conditions. We compare our method to Soft Actor Critic (SAC) and Proximal Policy Optimisation (PPO). In this comparison, we solve all tasks, we never underperform in task cost and we show that at the point of our convergence, we outperform SAC and PPO in the best case by 4 and 2 orders of magnitude.

Neural Lyapunov and Optimal Control

TL;DR

This paper uses the Hamilton-Jacobi-Bellman (HJB) and first-order gradients to learn optimal time-varying value functions and therefore, policies and introduces an optimal control theoretic learning-based method that can solve the same problems robustly with simple parsimonious costs.

Abstract

Despite impressive results, reinforcement learning (RL) suffers from slow convergence and requires a large variety of tuning strategies. In this paper, we investigate the ability of RL algorithms on simple continuous control tasks. We show that without reward and environment tuning, RL suffers from poor convergence. In turn, we introduce an optimal control (OC) theoretic learning-based method that can solve the same problems robustly with simple parsimonious costs. We use the Hamilton-Jacobi-Bellman (HJB) and first-order gradients to learn optimal time-varying value functions and therefore, policies. We show the relaxation of our objective results in time-varying Lyapunov functions, further verifying our approach by providing guarantees over a compact set of initial conditions. We compare our method to Soft Actor Critic (SAC) and Proximal Policy Optimisation (PPO). In this comparison, we solve all tasks, we never underperform in task cost and we show that at the point of our convergence, we outperform SAC and PPO in the best case by 4 and 2 orders of magnitude.
Paper Structure (22 sections, 13 equations, 3 figures, 1 table)

This paper contains 22 sections, 13 equations, 3 figures, 1 table.

Figures (3)

  • Figure 1: Compact stability region for double integrator, computed by Neural Lyapunov Control.
  • Figure 2: Top row: Constraint satisfaction loss for value and Lyapunov function constraints. Middle row: Trajectory cost using our method. Bottom row: SAC and PPO trajectory cost. Due to high values, SAC costs are scaled for visualisation.
  • Figure 3: Cartpole balancing Lyapunov trajectories.