Table of Contents
Fetching ...

Quasi-Newton Compatible Actor-Critic for Deterministic Policies

Arash Bahari Kordabad, Dean Brandner, Sebastien Gros, Sergio Lucia, Sadegh Soudjani

TL;DR

The paper tackles slow convergence in deterministic policy gradient methods by introducing a quasi-Newton actor–critic framework that leverages curvature information through a compatible quadratic critic. A batch least-squares temporal-difference (LSTD) procedure learns critic parameters $(v,g,W)$ so that the critic preserves both the true gradient $∇_θ J(θ)$ and the Hessian-like update $H(θ)$, enabling a Newton-like update $θ_{i+1}=θ_i-α_θ H^{-1} ∇_θ J(θ)$. The critic takes the form $Q^w(s,a)=A^w(s,a)+V^v(s)$ with $A^w(s,a)=ψ(s,a)^T W ψ(s,a)+ψ(s,a)^T g$ and $ψ(s,a)=∇_θπ_θ(s)(a-π_θ(s))$, ensuring compatibility via a PSD curvature matrix $W$ and gradient term $g$. Empirical results on an LQR and a cart–pendulum balancing task show faster convergence and better sample efficiency than first-order baselines, illustrating the approach’s practicality and broad applicability to differentiable policy classes, including MPC- and neural-network-based policies. The method thus provides a principled way to incorporate second-order information into deterministic policy learning for control problems where data efficiency and stability are crucial.

Abstract

In this paper, we propose a second-order deterministic actor-critic framework in reinforcement learning that extends the classical deterministic policy gradient method to exploit curvature information of the performance function. Building on the concept of compatible function approximation for the critic, we introduce a quadratic critic that simultaneously preserves the true policy gradient and an approximation of the performance Hessian. A least-squares temporal difference learning scheme is then developed to estimate the quadratic critic parameters efficiently. This construction enables a quasi-Newton actor update using information learned by the critic, yielding faster convergence compared to first-order methods. The proposed approach is general and applicable to any differentiable policy class. Numerical examples demonstrate that the method achieves improved convergence and performance over standard deterministic actor-critic baselines.

Quasi-Newton Compatible Actor-Critic for Deterministic Policies

TL;DR

The paper tackles slow convergence in deterministic policy gradient methods by introducing a quasi-Newton actor–critic framework that leverages curvature information through a compatible quadratic critic. A batch least-squares temporal-difference (LSTD) procedure learns critic parameters so that the critic preserves both the true gradient and the Hessian-like update , enabling a Newton-like update . The critic takes the form with and , ensuring compatibility via a PSD curvature matrix and gradient term . Empirical results on an LQR and a cart–pendulum balancing task show faster convergence and better sample efficiency than first-order baselines, illustrating the approach’s practicality and broad applicability to differentiable policy classes, including MPC- and neural-network-based policies. The method thus provides a principled way to incorporate second-order information into deterministic policy learning for control problems where data efficiency and stability are crucial.

Abstract

In this paper, we propose a second-order deterministic actor-critic framework in reinforcement learning that extends the classical deterministic policy gradient method to exploit curvature information of the performance function. Building on the concept of compatible function approximation for the critic, we introduce a quadratic critic that simultaneously preserves the true policy gradient and an approximation of the performance Hessian. A least-squares temporal difference learning scheme is then developed to estimate the quadratic critic parameters efficiently. This construction enables a quasi-Newton actor update using information learned by the critic, yielding faster convergence compared to first-order methods. The proposed approach is general and applicable to any differentiable policy class. Numerical examples demonstrate that the method achieves improved convergence and performance over standard deterministic actor-critic baselines.

Paper Structure

This paper contains 13 sections, 2 theorems, 31 equations, 9 figures, 1 algorithm.

Key Result

Theorem 1

For a differentiable deterministic policy $\boldsymbol{\mathrm{\pi}}_{\boldsymbol{\mathrm{\theta}}}(\boldsymbol{\mathrm{s}})$ and twice continuously differentiable action-value function $Q^{\boldsymbol{\mathrm{\pi}}_{\boldsymbol{\mathrm{\theta}}}} (\boldsymbol{\mathrm{s}},\boldsymbol{\mathrm{a}})$, This approximation captures the exact Hessian at the optimal policy parameters, that is, $H(\boldsy

Figures (9)

  • Figure 1: State evolution in the first step (blue) vs. the last step (orange) under the quasi-Newton actor--critic.
  • Figure 2: Control actions in the first step (blue) vs. the last step (orange) under the quasi-Newton actor--critic.
  • Figure 3: Norm of the performance gradient across policy updates: quasi-Newton (blue) vs. first-order DPG (red). The two methods observe comparable gradient magnitudes.
  • Figure 4: Parameters over updates: quasi-Newton (blue) vs. first-order DPG (red). Curvature information yields faster convergence.
  • Figure 5: Performance $J(\boldsymbol{\mathrm{\theta}})$ vs. policy-update index: quasi-Newton (blue) achieves a faster decrease and reaches a lower value earlier than first-order DPG (red).
  • ...and 4 more figures

Theorems & Definitions (3)

  • Theorem 1
  • Definition 1
  • Theorem 2