Table of Contents
Fetching ...

Actor-Critic Reinforcement Learning with Phased Actor

Ruofan Wu, Junmin Zhong, Jennie Si

TL;DR

PAAC addresses high variance in policy-gradient RL for continuous control by introducing a phased actor that blends $Q(x_k,\pi(x_k))$ and the TD error $\delta$ in the policy gradient. It provides convergence and variance-reduction guarantees and demonstrates that PAAC can be piggybacked on existing methods such as $\text{dHDP}$ and DDPG. Empirically, on the DeepMind Control Suite, PAAC improves total cost, learning variance, robustness, learning speed, and success rate across multiple tasks, and can yield an enhanced version of the basic dHDP when combined with replay and target networks. Overall, the work offers a unified view of policy-gradient algorithms and shows PAAC as a versatile enhancement that can be integrated into a broad class of actor-critic methods for deterministic control.

Abstract

Policy gradient methods in actor-critic reinforcement learning (RL) have become perhaps the most promising approaches to solving continuous optimal control problems. However, the trial-and-error nature of RL and the inherent randomness associated with solution approximations cause variations in the learned optimal values and policies. This has significantly hindered their successful deployment in real life applications where control responses need to meet dynamic performance criteria deterministically. Here we propose a novel phased actor in actor-critic (PAAC) method, aiming at improving policy gradient estimation and thus the quality of the control policy. Specifically, PAAC accounts for both $Q$ value and TD error in its actor update. We prove qualitative properties of PAAC for learning convergence of the value and policy, solution optimality, and stability of system dynamics. Additionally, we show variance reduction in policy gradient estimation. PAAC performance is systematically and quantitatively evaluated in this study using DeepMind Control Suite (DMC). Results show that PAAC leads to significant performance improvement measured by total cost, learning variance, robustness, learning speed and success rate. As PAAC can be piggybacked onto general policy gradient learning frameworks, we select well-known methods such as direct heuristic dynamic programming (dHDP), deep deterministic policy gradient (DDPG) and their variants to demonstrate the effectiveness of PAAC. Consequently we provide a unified view on these related policy gradient algorithms.

Actor-Critic Reinforcement Learning with Phased Actor

TL;DR

PAAC addresses high variance in policy-gradient RL for continuous control by introducing a phased actor that blends and the TD error in the policy gradient. It provides convergence and variance-reduction guarantees and demonstrates that PAAC can be piggybacked on existing methods such as and DDPG. Empirically, on the DeepMind Control Suite, PAAC improves total cost, learning variance, robustness, learning speed, and success rate across multiple tasks, and can yield an enhanced version of the basic dHDP when combined with replay and target networks. Overall, the work offers a unified view of policy-gradient algorithms and shows PAAC as a versatile enhancement that can be integrated into a broad class of actor-critic methods for deterministic control.

Abstract

Policy gradient methods in actor-critic reinforcement learning (RL) have become perhaps the most promising approaches to solving continuous optimal control problems. However, the trial-and-error nature of RL and the inherent randomness associated with solution approximations cause variations in the learned optimal values and policies. This has significantly hindered their successful deployment in real life applications where control responses need to meet dynamic performance criteria deterministically. Here we propose a novel phased actor in actor-critic (PAAC) method, aiming at improving policy gradient estimation and thus the quality of the control policy. Specifically, PAAC accounts for both value and TD error in its actor update. We prove qualitative properties of PAAC for learning convergence of the value and policy, solution optimality, and stability of system dynamics. Additionally, we show variance reduction in policy gradient estimation. PAAC performance is systematically and quantitatively evaluated in this study using DeepMind Control Suite (DMC). Results show that PAAC leads to significant performance improvement measured by total cost, learning variance, robustness, learning speed and success rate. As PAAC can be piggybacked onto general policy gradient learning frameworks, we select well-known methods such as direct heuristic dynamic programming (dHDP), deep deterministic policy gradient (DDPG) and their variants to demonstrate the effectiveness of PAAC. Consequently we provide a unified view on these related policy gradient algorithms.
Paper Structure (16 sections, 4 theorems, 56 equations, 4 figures, 2 tables, 1 algorithm)

This paper contains 16 sections, 4 theorems, 56 equations, 4 figures, 2 tables, 1 algorithm.

Key Result

Theorem 1

Consider the state-action value $Q(x_k,u_k)$ as in (equ: PAAC HJB equation), and the control policy $\pi(x_k|\theta)$, the parameters of which are updated based on the policy gradient estimator in (equ:actor loss). Then we have the following

Figures (4)

  • Figure 1: A unified actor-critic framework: Illustration of different DRL algorithms and how they relate to one another. Inside the white bubble is the vanilla dHDP as in si2001online, the yellow bubble forms DDPG lillicrap2015continuous. The green box is how the phased actor is realized, and it can be piggybacked onto general actor-critic structures such as vanilla dHDP and DDPG as shown. The dash lines show the sample collection path and black lines show how learning proceeds.
  • Figure 2: Learning curves of averaged total cost for benchmark study. Each learning curve is averaged over 10 different random seeds and shaded by their respective 95% confidence interval
  • Figure 3: Learning curves of averaged total cost for different PAC switching method. Each learning curve is averaged over 10 different random seeds and shaded by their respective 95% confidence interval
  • Figure 4: Learning curves of averaged total cost for ablation study. Each learning curve is averaged over 10 different random seeds and shaded by their respective 95% confidence interval

Theorems & Definitions (6)

  • Remark 1
  • Theorem 1
  • Remark 2
  • Theorem 2
  • Theorem 3
  • Theorem 4