Double Actor-Critic with TD Error-Driven Regularization in Reinforcement Learning

Haohui Chen; Zhiyong Chen; Aoxiang Liu; Wentuo Fang

Double Actor-Critic with TD Error-Driven Regularization in Reinforcement Learning

Haohui Chen, Zhiyong Chen, Aoxiang Liu, Wentuo Fang

TL;DR

Compared to classical deterministic policy gradient-based algorithms that lack a double actor-critic structure, TDDR provides superior estimation and does not introduce any additional hyperparameters, significantly simplifying the design and implementation process.

Abstract

To obtain better value estimation in reinforcement learning, we propose a novel algorithm based on the double actor-critic framework with temporal difference error-driven regularization, abbreviated as TDDR. TDDR employs double actors, with each actor paired with a critic, thereby fully leveraging the advantages of double critics. Additionally, TDDR introduces an innovative critic regularization architecture. Compared to classical deterministic policy gradient-based algorithms that lack a double actor-critic structure, TDDR provides superior estimation. Moreover, unlike existing algorithms with double actor-critic frameworks, TDDR does not introduce any additional hyperparameters, significantly simplifying the design and implementation process. Experiments demonstrate that TDDR exhibits strong competitiveness compared to benchmark algorithms in challenging continuous control tasks.

Double Actor-Critic with TD Error-Driven Regularization in Reinforcement Learning

TL;DR

Abstract

Paper Structure (20 sections, 2 theorems, 44 equations, 4 figures, 6 tables, 1 algorithm)

This paper contains 20 sections, 2 theorems, 44 equations, 4 figures, 6 tables, 1 algorithm.

Introduction
Related Work
Related Work
Comparison
Preliminaries
Deterministic Policy Gradient
Double Actor-Critic
Regularization
The TDDR Algorithm
Double Actors with CDQ
Critic Regularization Architecture
Comparative Analysis of Algorithms
Convergence Analysis
Experiments
Comparison with DDPG and TD3
...and 5 more sections

Key Result

Lemma 1

Consider a stochastic process $(\zeta_t, \Delta_t, F_t)$, $t \geq 0$, where $\zeta_t$, $\Delta_t$, and $F_t$ satisfy: for $x_t \in X$ and $t\geq 0$. Let $P_t$ be a sequence of increasing $\sigma$-fields such that $\zeta_0$ and $\Delta_0$ are $P_0$-measurable, and $\zeta_t$, $\Delta_t$ and $F_{t-1}$ are $P_t$-measurable, $t=1, 2, \ldots$. Assume that the following hold: Then $\Delta_t$ converges

Figures (4)

Figure 1: Architecture of TDDR and the benchmark algorithms: $q_{ij} = Q_{\theta_i'}(s', a_j')$ and $q_i = Q_{\theta_i'}(s, a)$; the duplication of $Q_1'/Q_2'$ indicates that the same networks are used with different inputs; the action $a_i'$ is generated by $A_i'$ following \ref{['aprime']}.
Figure 2: Comparison of TDDR with DDPG and TD3 across nine environments. (a) Ant-v2, (b) HalfCheetah-v2, (c) Hopper-v2, (d) Walker2d-v2, (e) Reacher-v2, (f) InvertedPendulum-v2, (g) InvertedDoublePendulum-v2, (h) BipedalWalker-v3, (i) LunarLanderContinuous-v2.
Figure 3: Comparison of TDDR with DARC, SD3, and GD3 with better hyperparameters across nine environments. (a) Ant-v2, (b) HalfCheetah-v2, (c) Hopper-v2, (d) Walker2d-v2, (e) Reacher-v2, (f) InvertedPendulum-v2, (g) InvertedDoublePendulum-v2, (h) BipedalWalker-v3, (i) LunarLanderContinuous-v2.
Figure 4: Comparison of TDDR with DARC, SD3, and GD3 with worse hyperparameters across three environments. (a) Ant-v2, (b) HalfCheetah-v2, (c) Walker2d-v2.

Theorems & Definitions (3)

Lemma 1
Theorem 1
proof

Double Actor-Critic with TD Error-Driven Regularization in Reinforcement Learning

TL;DR

Abstract

Double Actor-Critic with TD Error-Driven Regularization in Reinforcement Learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (3)