VA-learning as a more efficient alternative to Q-learning

Yunhao Tang; Rémi Munos; Mark Rowland; Michal Valko

VA-learning as a more efficient alternative to Q-learning

Yunhao Tang, Rémi Munos, Mark Rowland, Michal Valko

TL;DR

This work introduces VA-learning, which directly learns advantage function and value function using bootstrapping, without explicit reference to Q-functions, and improves the sample efficiency over Q-learning both in tabular implementations and deep RL agents on Atari-57 games.

Abstract

In reinforcement learning, the advantage function is critical for policy improvement, but is often extracted from a learned Q-function. A natural question is: Why not learn the advantage function directly? In this work, we introduce VA-learning, which directly learns advantage function and value function using bootstrapping, without explicit reference to Q-functions. VA-learning learns off-policy and enjoys similar theoretical guarantees as Q-learning. Thanks to the direct learning of advantage function and value function, VA-learning improves the sample efficiency over Q-learning both in tabular implementations and deep RL agents on Atari-57 games. We also identify a close connection between VA-learning and the dueling architecture, which partially explains why a simple architectural change to DQN agents tends to improve performance.

VA-learning as a more efficient alternative to Q-learning

TL;DR

Abstract

Paper Structure (49 sections, 5 theorems, 58 equations, 7 figures, 1 table, 3 algorithms)

This paper contains 49 sections, 5 theorems, 58 equations, 7 figures, 1 table, 3 algorithms.

Introduction
Theoretical guarantee as Q-learning.
Improved efficiency of tabular algorithms.
Large-scale value-based learning.
Background
VA-learning
Policy evaluation and control
Understanding the back-up targets.
Control case.
Why VA-learning can be more efficient
An illustrative example.
Can VA-learning underperform Q-learning?
Convergence of VA-learning
Convergence of VA-learning from VA recursion.
VA-learning with function approximation
...and 34 more sections

Key Result

Theorem 1

(Convergence of VA recursion) For the policy evaluation case, define $V_\mu^\pi(x)\coloneqq \sum_a \mu(a|x)Q^\pi(x,a)$ and $A_\mu^\pi(x,a)\coloneqq Q^\pi(x,a)-V_\mu^\pi(x)$. Let $C_\mu^\pi=\left\lVert V_0 - V_\mu^\pi \right\rVert_\infty + \left\lVert A_0 - A_\mu^\pi \right\rVert_\infty$ be the initi which also implies $\left\lVert Q_t-Q^\pi\right\rVert_\infty=\mathcal{O}(\gamma^t)$. For the contro

Figures (7)

Figure 1: A simple scenario to illustrate the effectiveness of VA-learning over Q-learning. There are two states $x,y$ and from state $y$ there are two actions $a,b$. Assume there is a back-up target $\widehat{Q}(y,a)$, VA-learning will update the prediction for both $Q(y,a)$ and $Q(y,b)$ thanks to the shared value function $V(y)$. In contrary, Q-learning only updates the prediction $Q(y,a)$. The accelerated learning of $Q(y,b)$ helps accelerate learning $Q(x,\cdot)$ when bootstrapping from $Q(y,\cdot)$.
Figure 2: Comparing VA-learning (Section \ref{['sec:VA-learning']}), Q-learning with behavior dueling (Section \ref{['sec:b-dueling']}), Q-learning with uniform dueling wang2016sample and regular Q-learning. We experiment on tabular MDPs with fixed behavior policy $\mu=\epsilon u + (1-\epsilon)\pi_\text{det}$ for some randomly sampled and fixed deterministic policy $\pi_\text{det}$, uniform policy $u$ and $\epsilon=0.8$. The performance evaluates the greedy policy with learned Q-function. New algorithmic variants significantly outperform prior methods.
Figure 3: Comparing different algorithmic variants in tabular MDPs with fixed behavior policy $\mu=\epsilon u + (1-\epsilon)\pi_\text{det}$ for some randomly sampled and fixed deterministic policy $\pi_\text{det}$, uniform policy $u$ and varying degree of $\epsilon$ ($x$-axis). As $\epsilon\rightarrow 1$ and $\mu$ approaches uniform, Q-learning with dueling architecture catches up in performance with behavior dueling and VA-learning.
Figure 4: Comparing algorithmic variants implemented with full Atari action set. VA-learning and behavior dueling are significantly better than the uniform dueling architecture, which further improves over the $n$-step Q-learning baseline. Compared to the standard Atari setup in Figure \ref{['fig:epsilon-dqn']}(b), the performance of VA-learning and behavior dueling does not degrade.
Figure 5: (a) Comparing VA-learning (Section \ref{['sec:VA-learning']}) with Q-learning for tabular policy evaluation. We evaluate a target policy $\pi$ formed as a convex combination of a deterministic policy and uniform policy, using a fixed number of trajectories data collected under uniform policy. The $y$-axis shows the approximation error to the advantage $\left\lVert \widehat{A}_t-A^\pi\right\rVert_2$ at each iteration $k$. Given any data budget, VA-learning obtains more accurate approximations to the advantage function compared to Q-learning. (b) The same setup as before. The $y$-axis shows the approximation error to the Q-function $\left\lVert \widehat{Q}_t-Q^\pi\right\rVert_2$ at each iteration $k$. Given any data budget, VA-learning obtains a slightly faster rate of approximating the Q-function compared to Q-learning.
...and 2 more figures

Theorems & Definitions (9)

Theorem 1
proof
Theorem 1
proof
Theorem 2
Lemma 2
proof
Lemma 2
proof

VA-learning as a more efficient alternative to Q-learning

TL;DR

Abstract

VA-learning as a more efficient alternative to Q-learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (7)

Theorems & Definitions (9)