Reusing Historical Trajectories in Natural Policy Gradient via Importance Sampling: Convergence and Convergence Rate

Yifan Lin; Yuhao Wang; Enlu Zhou

Reusing Historical Trajectories in Natural Policy Gradient via Importance Sampling: Convergence and Convergence Rate

Yifan Lin, Yuhao Wang, Enlu Zhou

TL;DR

The authors rigorously analyze the convergence properties of this approach and demonstrate that reusing past data enhances convergence rates while maintaining theoretical guarantees, and provides a valuable contribution to the field of reinforcement learning.

Abstract

Reinforcement learning provides a mathematical framework for learning-based control, whose success largely depends on the amount of data it can utilize. The efficient utilization of historical trajectories obtained from previous policies is essential for expediting policy optimization. Empirical evidence has shown that policy gradient methods based on importance sampling work well. However, existing literature often neglect the interdependence between trajectories from different iterations, and the good empirical performance lacks a rigorous theoretical justification. In this paper, we study a variant of the natural policy gradient method with reusing historical trajectories via importance sampling. We show that the bias of the proposed estimator of the gradient is asymptotically negligible, the resultant algorithm is convergent, and reusing past trajectories helps improve the convergence rate. We further apply the proposed estimator to popular policy optimization algorithms such as trust region policy optimization. Our theoretical results are verified on classical benchmarks.

Reusing Historical Trajectories in Natural Policy Gradient via Importance Sampling: Convergence and Convergence Rate

TL;DR

Abstract

Paper Structure (34 sections, 15 theorems, 90 equations, 9 figures)

This paper contains 34 sections, 15 theorems, 90 equations, 9 figures.

Introduction
Problem Formulation and Algorithm Design
Preliminaries: Markov Decision Process
Preliminaries: Natural Policy Gradient
Natural Policy Gradient with Reusing Historical Trajectories
Bias and Variance of the Gradient Estimator Reusing Samples from Previous Iteration
Summary of Main Theoretical Results
Convergence Analysis
Regularity Conditions for RNPG
Asymptotic Convergence by the ODE Method
Approximation and Extension
Characterization of Asymptotic Convergence Rate
Numerical Experiments
Experiment Setting and Benchmarks
Experiment I: Convergence Rate on Cartpole and Inverted Pendulum
...and 19 more sections

Key Result

Theorem 1

Let $\mathcal{D}^d[0,\infty)$ be the space of $\mathbb{R}^d$-valued operators which are right continuous and have left-hand limits for each dimension. Under Assumption ass:1, there exists a process $\theta^*(t)$ to which some subsequence of $\{\theta^n(t)\}$ converges w.p.1 in the space $\mathcal{D} where $\Bar{F}^{-1}(\theta) = \mathbb{E}\left[\left(\epsilon I_d+ \frac{1}{B}\sum_{i=1}^B S(\xi_i,\

Figures (9)

Figure 1: Diagram of cartpole and inverted pendulum task.
Figure 2: Mean (Figure 2(a)) and standard error (Figure 2(b)) of the reward over $n=150$ iterations for VPG, RPG, VNPG, and RNPG run on cartpole.
Figure 3: Mean (Figure 3(a)) and standard error (Figure 3(b)) of the reward over $n=500$ iterations for VPG, RPG, VNPG, and RNPG run on inverted pendulum.
Figure 4: Mean (Figure 4(a)) and standard error (Figure 4(b)) of the reward over $n=150$ iterations for RNPG under reuse sizes $K=1,10,50,100$ run on cartpole.
Figure 5: Time (s) running RNPG over $n=100$ iterations under different reuse sizes run on cartpole.
...and 4 more figures

Theorems & Definitions (16)

Theorem 1
Definition 1: Zero asymptotic rate of change
Lemma 1
Lemma 2
Lemma 3
Corollary 1
Corollary 2
Theorem 2
Lemma 4
Lemma 5
...and 6 more

Reusing Historical Trajectories in Natural Policy Gradient via Importance Sampling: Convergence and Convergence Rate

TL;DR

Abstract

Reusing Historical Trajectories in Natural Policy Gradient via Importance Sampling: Convergence and Convergence Rate

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (9)

Theorems & Definitions (16)