Improved Sample Complexity Analysis of Natural Policy Gradient Algorithm with General Parameterization for Infinite Horizon Discounted Reward Markov Decision Processes

Washim Uddin Mondal; Vaneet Aggarwal

Improved Sample Complexity Analysis of Natural Policy Gradient Algorithm with General Parameterization for Infinite Horizon Discounted Reward Markov Decision Processes

Washim Uddin Mondal, Vaneet Aggarwal

TL;DR

This work introduces the Accelerated Natural Policy Gradient (ANPG) algorithm for infinite-horizon discounted MDPs with general policy parameterization. By replacing the inner SGD with an accelerated stochastic gradient descent procedure and providing a refined global convergence analysis, the authors establish a sample complexity of $\mathcal{O}(\epsilon^{-2})$ and an iteration complexity of $\mathcal{O}(\epsilon^{-1})$, while remaining IS-free and Hessian-free. The key insight is interpreting the first-order estimation error as the trajectory of a noiseless accelerated gradient method, enabling faster decay and enabling the $\mathcal{O}(\epsilon^{-2})$ sample bound with a $\log(1/\epsilon)$ improvement over prior results. The results advance the practical viability of first-order, Hessian-free methods for policy optimization in large-scale RL, with potential extensions to constrained and utility-based settings.

Abstract

We consider the problem of designing sample efficient learning algorithms for infinite horizon discounted reward Markov Decision Process. Specifically, we propose the Accelerated Natural Policy Gradient (ANPG) algorithm that utilizes an accelerated stochastic gradient descent process to obtain the natural policy gradient. ANPG achieves $\mathcal{O}({ε^{-2}})$ sample complexity and $\mathcal{O}(ε^{-1})$ iteration complexity with general parameterization where $ε$ defines the optimality error. This improves the state-of-the-art sample complexity by a $\log(\frac{1}ε)$ factor. ANPG is a first-order algorithm and unlike some existing literature, does not require the unverifiable assumption that the variance of importance sampling (IS) weights is upper bounded. In the class of Hessian-free and IS-free algorithms, ANPG beats the best-known sample complexity by a factor of $\mathcal{O}(ε^{-\frac{1}{2}})$ and simultaneously matches their state-of-the-art iteration complexity.

Improved Sample Complexity Analysis of Natural Policy Gradient Algorithm with General Parameterization for Infinite Horizon Discounted Reward Markov Decision Processes

TL;DR

and an iteration complexity of

, while remaining IS-free and Hessian-free. The key insight is interpreting the first-order estimation error as the trajectory of a noiseless accelerated gradient method, enabling faster decay and enabling the

sample bound with a

improvement over prior results. The results advance the practical viability of first-order, Hessian-free methods for policy optimization in large-scale RL, with potential extensions to constrained and utility-based settings.

Abstract

sample complexity and

iteration complexity with general parameterization where

defines the optimality error. This improves the state-of-the-art sample complexity by a

factor. ANPG is a first-order algorithm and unlike some existing literature, does not require the unverifiable assumption that the variance of importance sampling (IS) weights is upper bounded. In the class of Hessian-free and IS-free algorithms, ANPG beats the best-known sample complexity by a factor of

and simultaneously matches their state-of-the-art iteration complexity.

Paper Structure (17 sections, 10 theorems, 55 equations, 1 table, 1 algorithm)

This paper contains 17 sections, 10 theorems, 55 equations, 1 table, 1 algorithm.

Introduction
Our Contributions and Challenges
Related Works
Problem Formulation
Algorithm
Sample Complexity Analysis
Outer Loop Analysis
Inner Loop Analysis
Final Result
Conclusion
Proof of Lemma \ref{['lemma:unbiased_estimate']}
Proof of Lemma \ref{['lemma:local_global']}
Proof of Lemma \ref{['lemma_gradient_bound']}
Proof of Lemma \ref{['lemma_noise_variance']}
Proofs of Lemma \ref{['lemma_second_order']} and \ref{['lemma_first_order']}
...and 2 more sections

Key Result

Lemma 1

If $\hat{\nabla}_\omega L_{\nu^{\pi_\theta}_\rho}(\omega, \theta)$ denotes the gradient estimate yielded by Algorithm algo_sampling, then the following holds.

Theorems & Definitions (12)

Lemma 1
Lemma 2
proof
Lemma 3
proof
Lemma 4
Corollary 1
Lemma 5
Lemma 6
Lemma 7
...and 2 more

Improved Sample Complexity Analysis of Natural Policy Gradient Algorithm with General Parameterization for Infinite Horizon Discounted Reward Markov Decision Processes

TL;DR

Abstract

Improved Sample Complexity Analysis of Natural Policy Gradient Algorithm with General Parameterization for Infinite Horizon Discounted Reward Markov Decision Processes

Authors

TL;DR

Abstract

Table of Contents

Key Result

Theorems & Definitions (12)