Table of Contents
Fetching ...

Second-Order Policy Gradient Methods for the Linear Quadratic Regulator

Amirreza Valaei, Arash Bahari Kordabad, Sadegh Soudjani

TL;DR

This work tackles slow convergence in policy gradient methods by deriving second-order, curvature-aware updates for the discounted LQR. It specializes the general policy-gradient Hessian framework to obtain a closed-form Gauss–Newton surrogate $H(\theta)$ and an explicit exact Hessian $\nabla_{\theta}^2 J(\theta) = H(\theta) + \gamma \Lambda(\theta)$, computable from the Lyapunov solution $P_\theta$ and state-covariance $\Sigma_\theta$ under mild assumptions. The authors demonstrate faster convergence and greater stability on a scalar LQR, an inverted pendulum, and a seismic-building model, with Newton updates achieving quadratic convergence and Gauss–Newton achieving superlinear rates, outperforming first-order policy gradient. This establishes practical, curvature-aware updates for a tractable RL setting and suggests promising extensions to model-free curvature estimation and robust control scenarios.

Abstract

Policy gradient methods are a powerful family of reinforcement learning algorithms for continuous control that optimize a policy directly. However, standard first-order methods often converge slowly. Second-order methods can accelerate learning by using curvature information, but they are typically expensive to compute. The linear quadratic regulator (LQR) is a practical setting in which key quantities, such as the policy gradient, admit closed-form expressions. In this work, we develop second-order policy gradient algorithms for LQR by deriving explicit formulas for both the approximate and exact Hessians used in Gauss--Newton and Newton methods, respectively. Numerical experiments show a faster convergence rate for the proposed second-order approach over the standard first-order policy gradient baseline.

Second-Order Policy Gradient Methods for the Linear Quadratic Regulator

TL;DR

This work tackles slow convergence in policy gradient methods by deriving second-order, curvature-aware updates for the discounted LQR. It specializes the general policy-gradient Hessian framework to obtain a closed-form Gauss–Newton surrogate and an explicit exact Hessian , computable from the Lyapunov solution and state-covariance under mild assumptions. The authors demonstrate faster convergence and greater stability on a scalar LQR, an inverted pendulum, and a seismic-building model, with Newton updates achieving quadratic convergence and Gauss–Newton achieving superlinear rates, outperforming first-order policy gradient. This establishes practical, curvature-aware updates for a tractable RL setting and suggests promising extensions to model-free curvature estimation and robust control scenarios.

Abstract

Policy gradient methods are a powerful family of reinforcement learning algorithms for continuous control that optimize a policy directly. However, standard first-order methods often converge slowly. Second-order methods can accelerate learning by using curvature information, but they are typically expensive to compute. The linear quadratic regulator (LQR) is a practical setting in which key quantities, such as the policy gradient, admit closed-form expressions. In this work, we develop second-order policy gradient algorithms for LQR by deriving explicit formulas for both the approximate and exact Hessians used in Gauss--Newton and Newton methods, respectively. Numerical experiments show a faster convergence rate for the proposed second-order approach over the standard first-order policy gradient baseline.

Paper Structure

This paper contains 13 sections, 6 theorems, 72 equations, 2 figures.

Key Result

Lemma 1

For the linear system eq:dynamics with quadratic stage cost eq:stage-cost, the value function $V_\theta$ and action-value function $Q_\theta$ corresponding to a $\gamma$-stabilizing linear policy $\pi_\theta(s) = -K s$ are obtained as follows: where

Figures (2)

  • Figure 1: Left: 3-D surface of the cost function $J(\theta)$ with trajectories. Right: Heatmap of the same region. The Newton PG (green) proceeds directly toward the optimal parameters, whereas the first order PG (red) fluctuates. Darker blue indicates lower cost.
  • Figure 2: The Frobenius-norm policy error, $\|K_k - K^\star\|_F$, versus iteration index $k$ for natural policy gradient, Gauss--Newton, and Newton. All algorithms are initialized at the same stabilizing gain $K_0$.

Theorems & Definitions (9)

  • Definition 1: Stabilization notions
  • Lemma 1
  • Theorem 1: Policy Gradient and Hessian
  • Lemma 2: Discounted State Correlation Matrix
  • Remark 1
  • Theorem 2: Exact Hessian in LQR
  • Lemma 3: Vanishing boundary flux
  • Remark 2
  • Lemma 4: Jacobian of $P_\theta$ w.r.t. $K$