AFU: Actor-Free critic Updates in off-policy RL for continuous control

Nicolas Perrin-Gilbert

AFU: Actor-Free critic Updates in off-policy RL for continuous control

Nicolas Perrin-Gilbert

TL;DR

AFU is presented, an off-policy deep RL algorithm addressing in a new way the challenging"max-Q problem" in Q-learning for continuous action spaces, with a solution based on regression and conditional gradient scaling, marking it as the first model-free off-policy algorithm competitive with state-of-the-art actor-critic methods while departing from the actor-critic perspective.

Abstract

This paper presents AFU, an off-policy deep RL algorithm addressing in a new way the challenging "max-Q problem" in Q-learning for continuous action spaces, with a solution based on regression and conditional gradient scaling. AFU has an actor but its critic updates are entirely independent from it. As a consequence, the actor can be chosen freely. In the initial version, AFU-alpha, we employ the same stochastic actor as in Soft Actor-Critic (SAC), but we then study a simple failure mode of SAC and show how AFU can be modified to make actor updates less likely to become trapped in local optima, resulting in a second version of the algorithm, AFU-beta. Experimental results demonstrate the sample efficiency of both versions of AFU, marking it as the first model-free off-policy algorithm competitive with state-of-the-art actor-critic methods while departing from the actor-critic perspective.

AFU: Actor-Free critic Updates in off-policy RL for continuous control

TL;DR

Abstract

Paper Structure (19 sections, 20 equations, 14 figures, 1 table, 1 algorithm)

This paper contains 19 sections, 20 equations, 14 figures, 1 table, 1 algorithm.

Introduction
Related Work
Preliminaries
A new way to solve the max-Q problem
Method
Experiments
Actor-free critic updates and actor training
AFU-alpha
Experiments
A simple failure mode of SAC
AFU-beta
Conclusion
Hyperparameters
Conditional gradient rescaling seen as adaptive regularization
Constraining the sign of $A_{\xi_i}(s,a)$ in a soft way
...and 4 more sections

Figures (14)

Figure 1: $Q_{toy}(s,a) = \sin(4s) + 0.7\cos(4a)$ for $(s, a) \in [-1, 1]^2$. Our method and IQL both train $V_{\varphi}(s)$ to approximate $s \mapsto \max_{a \in A} (Q_{toy}(s, a))$, i.e. solve a max-Q problem. Trainings are done with 3000 gradient descent steps on batches of 256 uniformly randomly drawn values of $(s, a)$.
Figure 2: Experimental evaluation of AFU-alpha on a benchmark of 7 MuJoCo tasks.
Figure 3: In orange: the reward function $R_{SFM}$ of the SFM environment. Since all transitions are terminal, $R_{SFM}$ coincides with the optimal Q-function. In blue: the critic ($Q_{SAC}$) obtained after a training of 20,000 steps with SAC haarnoja2018soft.
Figure 4: Trainings of SAC and AFU-beta in the SFM environment. Plots show results averaged over 10 runs with different random seeds, and shaded areas range from the 25th to the 75th percentile.
Figure 5: The gradient $v$ at $a_s$ (on the left) points away from $\mu_\zeta(s)$, which determines the direction toward the vicinity of the argmax of $Q_\psi(s, \cdot)$, so we modify $v$ to get $\mathcal{G}^{s, a_s}\bigl(v\bigr)$ by projecting it on the hyperplane orthogonal to $\mu_\zeta(s) - a_s$. The gradient $v'$ at $a'_s$ (on the right) points in the direction (half-space) of $\mu_\zeta(s)$, so we do not modify it, and $\mathcal{G}^{s, a'_s}\bigl(v'\bigr) = v'$.
...and 9 more figures

AFU: Actor-Free critic Updates in off-policy RL for continuous control

TL;DR

Abstract

AFU: Actor-Free critic Updates in off-policy RL for continuous control

Authors

TL;DR

Abstract

Table of Contents

Figures (14)