Actor-Accelerated Policy Dual Averaging for Reinforcement Learning in Continuous Action Spaces

Ji Gao; Caleb Ju; Guanghui Lan; Zhaohui Tong

Actor-Accelerated Policy Dual Averaging for Reinforcement Learning in Continuous Action Spaces

Ji Gao, Caleb Ju, Guanghui Lan, Zhaohui Tong

TL;DR

This paper provides a theoretical analysis that quantifies how actor approximation error impacts the convergence of PDA under suitable assumptions, and proposes Actor-accelerated PDA, which uses a learned policy network to approximate the solution of the optimization sub-problems, yielding faster runtimes while maintaining convergence guarantees.

Abstract

Policy Dual Averaging (PDA) offers a principled Policy Mirror Descent (PMD) framework that more naturally admits value function approximation than standard PMD, enabling the use of approximate advantage (or Q-) functions while retaining strong convergence guarantees. However, applying PDA in continuous state and action spaces remains computationally challenging, since action selection involves solving an optimization sub-problem at each decision step. In this paper, we propose \textit{actor-accelerated PDA}, which uses a learned policy network to approximate the solution of the optimization sub-problems, yielding faster runtimes while maintaining convergence guarantees. We provide a theoretical analysis that quantifies how actor approximation error impacts the convergence of PDA under suitable assumptions. We then evaluate its performance on several benchmarks in robotics, control, and operations research problems. Actor-accelerated PDA achieves superior performance compared to popular on-policy baselines such as Proximal Policy Optimization (PPO). Overall, our results bridge the gap between the theoretical advantages of PDA and its practical deployment in continuous-action problems with function approximation.

Actor-Accelerated Policy Dual Averaging for Reinforcement Learning in Continuous Action Spaces

TL;DR

Abstract

Paper Structure (28 sections, 3 theorems, 54 equations, 8 figures, 5 tables, 2 algorithms)

This paper contains 28 sections, 3 theorems, 54 equations, 8 figures, 5 tables, 2 algorithms.

Introduction
Preliminaries
Method
Actor-accelerated Policy Dual Averaging
Convergence Analysis
Convergence when $\tilde{\mu}_d \ge 0$
Convergence when $\tilde{\mu}_d < 0$
Implementation
Experiment
Evaluation of Optimum Tracking
Continuous Control Benchmark
Operations Research Benchmark
Analysis
Sensitivity Study for $\sigma_0$ and $\lambda$
Choice for Sum Advantage Update
...and 13 more sections

Key Result

Theorem 3.4

Suppose $\lambda_{k+1} \geq \lambda_k \geq 0$ for all $k \geq 0$. Then under Assumptions assumption:1 and assumption:2, if $\tilde{\mu}_d \ge 0$, the performance gap of the sequence of policies generated by actor-accelerated PDA satisfies: for any $s \in \mathbb{S}$, where $\bar{\beta}_k = \sum_{t=0}^{k-1} \beta_t$, and the cumulative optimization error term is defined as: When $\tilde{\mu}_d>0$

Figures (8)

Figure 1: Visualization of optimum tracking with the actor in the Pendulum-v1 environment. The landscape evolution of the scaled optimization sub-problem $\tilde{\Psi}'$ over epochs 5, 8, and 11, at a fixed $\dot{\theta}=0.2$ is shown in the first three plots. The pink dotted line represents the true optimum of $\tilde{\Psi}'$, and the solid red line represents the output of the actor network. The final plot tracks the mean absolute error (MAE) between the true optimum and the actor output averaged across a range of states and actions at each epoch for an extended training process.
Figure 2: Performance comparison of PDA, PPO, TRPO, and NPG across MuJoCo and Box2D environments. The curves and the shaded areas represent the mean and standard deviation across 100 test evaluations (10 seeds per environment and 10 tests per seed), respectively.
Figure 3: OR-Gym Benchmark for PDA and PPO. Episodic reward distributions of agents after training for 3 million (Newsvendor) and 1 million (PortfolioOpt) environment steps, aggregated over 10 random seeds with $10^3$ random trials per seed. A lower threshold is applied for the Newsvendor environment to maintain readability of the plot due to the existence of extreme negative values for PPO. The training curves are shown in Appendix \ref{['sec:app:or']}.
Figure 4: Sensitivity on exploration noise parameter $\sigma_0$ and step size $\lambda$. The heatmap shows the testing episodic reward averaged for the last 5 epochs. The final plot shows the sensitivity study with constant noise (3 seeds per environment and 10 tests per seed), while the rest of the plots show the default decreasing noise (5 seeds per environment and 10 tests per seed)
Figure 5: Performance comparison between the PDA with scaled sum advantage update (labeled as PDA) and an exponential smoothing scheme (labeled with smoothing parameter $\alpha$) across two environments. Results are averaged over 5 random seeds. The curve and shaded region represent the mean and standard deviation of the test results.
...and 3 more figures

Theorems & Definitions (7)

Remark 3.2
Theorem 3.4
Theorem 3.5
Lemma A.1
proof
proof
proof

Actor-Accelerated Policy Dual Averaging for Reinforcement Learning in Continuous Action Spaces

TL;DR

Abstract

Actor-Accelerated Policy Dual Averaging for Reinforcement Learning in Continuous Action Spaces

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (8)

Theorems & Definitions (7)