Table of Contents
Fetching ...

Global Convergence of Natural Policy Gradient with Hessian-aided Momentum Variance Reduction

Jie Feng, Ke Wei, Jinchi Chen

TL;DR

This paper introduces NPG-HM, a single-loop natural policy gradient method augmented with Hessian-aided momentum variance reduction to improve sample efficiency in RL. By avoiding importance sampling and solving the update subproblem with SGD, the authors prove global last-iterate convergence with a sample complexity of $O(\varepsilon^{-2})$ under Fisher-non-degenerate policies, grounded in a relaxed gradient-dominance framework and a novel error decomposition. Theoretical results are complemented by Mujoco-based experiments showing that NPG-HM outperforms several state-of-the-art policy-gradient methods. These contributions provide a practical, theoretically sound approach for efficient continuous-control RL with strong global convergence guarantees.

Abstract

Natural policy gradient (NPG) and its variants are widely-used policy search methods in reinforcement learning. Inspired by prior work, a new NPG variant coined NPG-HM is developed in this paper, which utilizes the Hessian-aided momentum technique for variance reduction, while the sub-problem is solved via the stochastic gradient descent method. It is shown that NPG-HM can achieve the global last iterate $ε$-optimality with a sample complexity of $\mathcal{O}(ε^{-2})$, which is the best known result for natural policy gradient type methods under the generic Fisher non-degenerate policy parameterizations. The convergence analysis is built upon a relaxed weak gradient dominance property tailored for NPG under the compatible function approximation framework, as well as a neat way to decompose the error when handling the sub-problem. Moreover, numerical experiments on Mujoco-based environments demonstrate the superior performance of NPG-HM over other state-of-the-art policy gradient methods.

Global Convergence of Natural Policy Gradient with Hessian-aided Momentum Variance Reduction

TL;DR

This paper introduces NPG-HM, a single-loop natural policy gradient method augmented with Hessian-aided momentum variance reduction to improve sample efficiency in RL. By avoiding importance sampling and solving the update subproblem with SGD, the authors prove global last-iterate convergence with a sample complexity of under Fisher-non-degenerate policies, grounded in a relaxed gradient-dominance framework and a novel error decomposition. Theoretical results are complemented by Mujoco-based experiments showing that NPG-HM outperforms several state-of-the-art policy-gradient methods. These contributions provide a practical, theoretically sound approach for efficient continuous-control RL with strong global convergence guarantees.

Abstract

Natural policy gradient (NPG) and its variants are widely-used policy search methods in reinforcement learning. Inspired by prior work, a new NPG variant coined NPG-HM is developed in this paper, which utilizes the Hessian-aided momentum technique for variance reduction, while the sub-problem is solved via the stochastic gradient descent method. It is shown that NPG-HM can achieve the global last iterate -optimality with a sample complexity of , which is the best known result for natural policy gradient type methods under the generic Fisher non-degenerate policy parameterizations. The convergence analysis is built upon a relaxed weak gradient dominance property tailored for NPG under the compatible function approximation framework, as well as a neat way to decompose the error when handling the sub-problem. Moreover, numerical experiments on Mujoco-based environments demonstrate the superior performance of NPG-HM over other state-of-the-art policy gradient methods.
Paper Structure (28 sections, 11 theorems, 83 equations, 1 figure, 2 tables, 2 algorithms)

This paper contains 28 sections, 11 theorems, 83 equations, 1 figure, 2 tables, 2 algorithms.

Key Result

Theorem 3.1

Suppose $H = -\frac{1}{\log \gamma} \log(T+\tau_0), \beta_t = \frac{\tau_0}{t+\tau_0}, \alpha_t = \alpha_0 \beta_t^{1/2}, \lambda_t = \lambda_0 \beta_t^{-1/2}, \lambda_0 = \frac{\kappa \tau_0 \alpha_0}{4\mu_F}$ and $\alpha_0= \sqrt{\frac{\mu_F^2}{\kappa\tau_0(12L^2 + 6\nu_h^2)}}$, where $t\geq 1$ a

Figures (1)

  • Figure 1: Empirical comparison of NPG-HM and other policy gradient methods on six environments.

Theorems & Definitions (20)

  • Definition 3.1
  • Remark 3.1
  • Remark 3.2
  • Theorem 3.1
  • Remark 3.3
  • Lemma 4.1
  • Remark 4.1
  • Lemma 4.2
  • Remark 4.2
  • Lemma 4.3
  • ...and 10 more