Table of Contents
Fetching ...

Non-Asymptotic Analysis for Single-Loop (Natural) Actor-Critic with Compatible Function Approximation

Yudan Wang, Yue Wang, Yi Zhou, Shaofeng Zou

TL;DR

This work provides the tightest non-asymptotic convergence bounds for single-loop AC and NAC when using compatible function approximation, addressing the challenging interaction with a time-varying, policy-dependent critic under a single Markovian trajectory. By leveraging ω-dependent yet linear compatible features and a $k$-step TD critic, the authors eliminate the non-diminishing critic bias from the error bounds while preserving optimal sample complexities: $O(ε^{-2})$ for AC and $O(ε^{-3})$ for NAC, with neighborhood terms reduced to $ε$ and $ε+ brace root{ε_{ ext{actor}}}$ respectively. The analysis introduces a novel tracking-error framework that bounds the critic’s bias as a function of the policy gradient (for AC) or optimality gap (for NAC), and it handles non-ergodicity and time-varying features without decoupling actor/critic updates. The results enable efficient, theoretically grounded single-loop AC/NAC with compatible function approximation, and the appendix provides supporting proofs and experiments. These findings have practical impact on sample-efficient RL with reduced computational overhead, particularly for natural gradient variants that avoid Fisher information inversion.

Abstract

Actor-critic (AC) is a powerful method for learning an optimal policy in reinforcement learning, where the critic uses algorithms, e.g., temporal difference (TD) learning with function approximation, to evaluate the current policy and the actor updates the policy along an approximate gradient direction using information from the critic. This paper provides the \textit{tightest} non-asymptotic convergence bounds for both the AC and natural AC (NAC) algorithms. Specifically, existing studies show that AC converges to an $ε+\varepsilon_{\text{critic}}$ neighborhood of stationary points with the best known sample complexity of $\mathcal{O}(ε^{-2})$ (up to a log factor), and NAC converges to an $ε+\varepsilon_{\text{critic}}+\sqrt{\varepsilon_{\text{actor}}}$ neighborhood of the global optimum with the best known sample complexity of $\mathcal{O}(ε^{-3})$, where $\varepsilon_{\text{critic}}$ is the approximation error of the critic and $\varepsilon_{\text{actor}}$ is the approximation error induced by the insufficient expressive power of the parameterized policy class. This paper analyzes the convergence of both AC and NAC algorithms with compatible function approximation. Our analysis eliminates the term $\varepsilon_{\text{critic}}$ from the error bounds while still achieving the best known sample complexities. Moreover, we focus on the challenging single-loop setting with a single Markovian sample trajectory. Our major technical novelty lies in analyzing the stochastic bias due to policy-dependent and time-varying compatible function approximation in the critic, and handling the non-ergodicity of the MDP due to the single Markovian sample trajectory. Numerical results are also provided in the appendix.

Non-Asymptotic Analysis for Single-Loop (Natural) Actor-Critic with Compatible Function Approximation

TL;DR

This work provides the tightest non-asymptotic convergence bounds for single-loop AC and NAC when using compatible function approximation, addressing the challenging interaction with a time-varying, policy-dependent critic under a single Markovian trajectory. By leveraging ω-dependent yet linear compatible features and a -step TD critic, the authors eliminate the non-diminishing critic bias from the error bounds while preserving optimal sample complexities: for AC and for NAC, with neighborhood terms reduced to and respectively. The analysis introduces a novel tracking-error framework that bounds the critic’s bias as a function of the policy gradient (for AC) or optimality gap (for NAC), and it handles non-ergodicity and time-varying features without decoupling actor/critic updates. The results enable efficient, theoretically grounded single-loop AC/NAC with compatible function approximation, and the appendix provides supporting proofs and experiments. These findings have practical impact on sample-efficient RL with reduced computational overhead, particularly for natural gradient variants that avoid Fisher information inversion.

Abstract

Actor-critic (AC) is a powerful method for learning an optimal policy in reinforcement learning, where the critic uses algorithms, e.g., temporal difference (TD) learning with function approximation, to evaluate the current policy and the actor updates the policy along an approximate gradient direction using information from the critic. This paper provides the \textit{tightest} non-asymptotic convergence bounds for both the AC and natural AC (NAC) algorithms. Specifically, existing studies show that AC converges to an neighborhood of stationary points with the best known sample complexity of (up to a log factor), and NAC converges to an neighborhood of the global optimum with the best known sample complexity of , where is the approximation error of the critic and is the approximation error induced by the insufficient expressive power of the parameterized policy class. This paper analyzes the convergence of both AC and NAC algorithms with compatible function approximation. Our analysis eliminates the term from the error bounds while still achieving the best known sample complexities. Moreover, we focus on the challenging single-loop setting with a single Markovian sample trajectory. Our major technical novelty lies in analyzing the stochastic bias due to policy-dependent and time-varying compatible function approximation in the critic, and handling the non-ergodicity of the MDP due to the single Markovian sample trajectory. Numerical results are also provided in the appendix.
Paper Structure (26 sections, 27 theorems, 38 equations, 2 figures, 2 tables, 1 algorithm)

This paper contains 26 sections, 27 theorems, 38 equations, 2 figures, 2 tables, 1 algorithm.

Key Result

Proposition 1

With compatible function approximation, the policy gradient $\nabla J(\pi_\omega)$ can be rewritten as:

Figures (2)

  • Figure 1: Vanilla AC with fixed feature function v.s. One-step AC with compatible feature function v.s. $128$-step AC with compatible feature function.
  • Figure 2: Vanilla NAC with fixed feature function v.s. One-step NAC with compatible feature function v.s. $128$-step NAC with compatible feature function.

Theorems & Definitions (50)

  • Proposition 1: sutton1999policy
  • Proposition 2: peters2008natural
  • Proposition 3
  • Proposition 4
  • Proposition 5
  • Theorem 1
  • Theorem 2
  • Remark 1
  • proof : Proof sketch
  • Lemma 1
  • ...and 40 more