Table of Contents
Fetching ...

AFU: Actor-Free critic Updates in off-policy RL for continuous control

Nicolas Perrin-Gilbert

TL;DR

AFU is presented, an off-policy deep RL algorithm addressing in a new way the challenging"max-Q problem" in Q-learning for continuous action spaces, with a solution based on regression and conditional gradient scaling, marking it as the first model-free off-policy algorithm competitive with state-of-the-art actor-critic methods while departing from the actor-critic perspective.

Abstract

This paper presents AFU, an off-policy deep RL algorithm addressing in a new way the challenging "max-Q problem" in Q-learning for continuous action spaces, with a solution based on regression and conditional gradient scaling. AFU has an actor but its critic updates are entirely independent from it. As a consequence, the actor can be chosen freely. In the initial version, AFU-alpha, we employ the same stochastic actor as in Soft Actor-Critic (SAC), but we then study a simple failure mode of SAC and show how AFU can be modified to make actor updates less likely to become trapped in local optima, resulting in a second version of the algorithm, AFU-beta. Experimental results demonstrate the sample efficiency of both versions of AFU, marking it as the first model-free off-policy algorithm competitive with state-of-the-art actor-critic methods while departing from the actor-critic perspective.

AFU: Actor-Free critic Updates in off-policy RL for continuous control

TL;DR

AFU is presented, an off-policy deep RL algorithm addressing in a new way the challenging"max-Q problem" in Q-learning for continuous action spaces, with a solution based on regression and conditional gradient scaling, marking it as the first model-free off-policy algorithm competitive with state-of-the-art actor-critic methods while departing from the actor-critic perspective.

Abstract

This paper presents AFU, an off-policy deep RL algorithm addressing in a new way the challenging "max-Q problem" in Q-learning for continuous action spaces, with a solution based on regression and conditional gradient scaling. AFU has an actor but its critic updates are entirely independent from it. As a consequence, the actor can be chosen freely. In the initial version, AFU-alpha, we employ the same stochastic actor as in Soft Actor-Critic (SAC), but we then study a simple failure mode of SAC and show how AFU can be modified to make actor updates less likely to become trapped in local optima, resulting in a second version of the algorithm, AFU-beta. Experimental results demonstrate the sample efficiency of both versions of AFU, marking it as the first model-free off-policy algorithm competitive with state-of-the-art actor-critic methods while departing from the actor-critic perspective.
Paper Structure (19 sections, 20 equations, 14 figures, 1 table, 1 algorithm)

This paper contains 19 sections, 20 equations, 14 figures, 1 table, 1 algorithm.

Figures (14)

  • Figure 1: $Q_{toy}(s,a) = \sin(4s) + 0.7\cos(4a)$ for $(s, a) \in [-1, 1]^2$. Our method and IQL both train $V_{\varphi}(s)$ to approximate $s \mapsto \max_{a \in A} (Q_{toy}(s, a))$, i.e. solve a max-Q problem. Trainings are done with 3000 gradient descent steps on batches of 256 uniformly randomly drawn values of $(s, a)$.
  • Figure 2: Experimental evaluation of AFU-alpha on a benchmark of 7 MuJoCo tasks.
  • Figure 3: In orange: the reward function $R_{SFM}$ of the SFM environment. Since all transitions are terminal, $R_{SFM}$ coincides with the optimal Q-function. In blue: the critic ($Q_{SAC}$) obtained after a training of 20,000 steps with SAC haarnoja2018soft.
  • Figure 4: Trainings of SAC and AFU-beta in the SFM environment. Plots show results averaged over 10 runs with different random seeds, and shaded areas range from the 25th to the 75th percentile.
  • Figure 5: The gradient $v$ at $a_s$ (on the left) points away from $\mu_\zeta(s)$, which determines the direction toward the vicinity of the argmax of $Q_\psi(s, \cdot)$, so we modify $v$ to get $\mathcal{G}^{s, a_s}\bigl(v\bigr)$ by projecting it on the hyperplane orthogonal to $\mu_\zeta(s) - a_s$. The gradient $v'$ at $a'_s$ (on the right) points in the direction (half-space) of $\mu_\zeta(s)$, so we do not modify it, and $\mathcal{G}^{s, a'_s}\bigl(v'\bigr) = v'$.
  • ...and 9 more figures