Table of Contents
Fetching ...

On the Second-Order Convergence of Biased Policy Gradient Algorithms

Siqiao Mu, Diego Klabjan

TL;DR

This work analyzes second-order convergence of biased policy gradient methods for reinforcement learning with infinite-horizon discounted objectives. It develops a general framework showing that biased gradient estimators, including vanilla GPOMDP and double-loop actor-critic with TD(0) critics, can escape saddle points provided the bias and noise are sufficiently controlled. The authors establish finite-time guarantees: vanilla policy gradient achieves an $\epsilon$-second-order stationary point in $\tilde{O}(\epsilon^{-6.5})$ iterations with horizon $H=O(\log(1/\epsilon^2))$, and the actor-critic variant attains the same outer rate with inner TD(0) steps on the order of $\tilde{O}(\epsilon^{-8})$. They also prove a finite-time TD(0) convergence result on nonstationary Markov chains, enabling realistic analysis of actor-critic methods. These results advance understanding of how bias in gradient estimators affects saddle-point avoidance and provide practical convergence guarantees for biased RL algorithms.

Abstract

Since the objective functions of reinforcement learning problems are typically highly nonconvex, it is desirable that policy gradient, the most popular algorithm, escapes saddle points and arrives at second-order stationary points. Existing results only consider vanilla policy gradient algorithms with unbiased gradient estimators, but practical implementations under the infinite-horizon discounted reward setting are biased due to finite-horizon sampling. Moreover, actor-critic methods, whose second-order convergence has not yet been established, are also biased due to the critic approximation of the value function. We provide a novel second-order analysis of biased policy gradient methods, including the vanilla gradient estimator computed from Monte-Carlo sampling of trajectories as well as the double-loop actor-critic algorithm, where in the inner loop the critic improves the approximation of the value function via TD(0) learning. Separately, we also establish the convergence of TD(0) on Markov chains irrespective of initial state distribution.

On the Second-Order Convergence of Biased Policy Gradient Algorithms

TL;DR

This work analyzes second-order convergence of biased policy gradient methods for reinforcement learning with infinite-horizon discounted objectives. It develops a general framework showing that biased gradient estimators, including vanilla GPOMDP and double-loop actor-critic with TD(0) critics, can escape saddle points provided the bias and noise are sufficiently controlled. The authors establish finite-time guarantees: vanilla policy gradient achieves an -second-order stationary point in iterations with horizon , and the actor-critic variant attains the same outer rate with inner TD(0) steps on the order of . They also prove a finite-time TD(0) convergence result on nonstationary Markov chains, enabling realistic analysis of actor-critic methods. These results advance understanding of how bias in gradient estimators affects saddle-point avoidance and provide practical convergence guarantees for biased RL algorithms.

Abstract

Since the objective functions of reinforcement learning problems are typically highly nonconvex, it is desirable that policy gradient, the most popular algorithm, escapes saddle points and arrives at second-order stationary points. Existing results only consider vanilla policy gradient algorithms with unbiased gradient estimators, but practical implementations under the infinite-horizon discounted reward setting are biased due to finite-horizon sampling. Moreover, actor-critic methods, whose second-order convergence has not yet been established, are also biased due to the critic approximation of the value function. We provide a novel second-order analysis of biased policy gradient methods, including the vanilla gradient estimator computed from Monte-Carlo sampling of trajectories as well as the double-loop actor-critic algorithm, where in the inner loop the critic improves the approximation of the value function via TD(0) learning. Separately, we also establish the convergence of TD(0) on Markov chains irrespective of initial state distribution.
Paper Structure (35 sections, 28 theorems, 204 equations, 3 algorithms)

This paper contains 35 sections, 28 theorems, 204 equations, 3 algorithms.

Key Result

Lemma 3.2

(Lemma 3.2 of zhang2020global) The score function $\nabla \log \pi_{\theta}(a | s)$ is $B$-Lipschitz continuous. Moreover, the policy gradient $\nabla J(\theta)$ is Lipschitz continuous such that for all $\theta_1$, $\theta_2$, we have where $L = \frac{\mathcal{R}_{max} B}{(1 - \gamma)^2} + \frac{(1 + \gamma) \mathcal{R}_{max} G^2}{(1 - \gamma)^3}$.

Theorems & Definitions (45)

  • Definition 2.1
  • Lemma 3.2
  • Lemma 3.3
  • Theorem 3.6
  • Theorem 3.7
  • Theorem 4.7
  • Theorem 4.8
  • Lemma 4.10
  • Theorem 4.11
  • Lemma 2.1
  • ...and 35 more