Table of Contents
Fetching ...

Increasing the Action Gap: New Operators for Reinforcement Learning

Marc G. Bellemare, Georg Ostrovski, Arthur Guez, Philip S. Thomas, Rémi Munos

TL;DR

The paper introduces a consistent Bellman operator that enforces local policy consistency, increases the action gap, and preserves optimality, with extensions to aggregation and Q-value interpolation for continuous spaces. It then broadens to a family of optimality-preserving, gap-increasing operators applicable to general function approximators, including convergent variants like Baird's Advantage Learning and PAL. Theoretical results guarantee optimality preservation under weak conditions, while empirical experiments on a bicycle domain and Atari 2600 games demonstrate substantial performance gains and reduced value-estimation bias. The work suggests that rethinking core value-iteration updates can yield significant practical benefits in high-frequency or large-scale RL settings.

Abstract

This paper introduces new optimality-preserving operators on Q-functions. We first describe an operator for tabular representations, the consistent Bellman operator, which incorporates a notion of local policy consistency. We show that this local consistency leads to an increase in the action gap at each state; increasing this gap, we argue, mitigates the undesirable effects of approximation and estimation errors on the induced greedy policies. This operator can also be applied to discretized continuous space and time problems, and we provide empirical results evidencing superior performance in this context. Extending the idea of a locally consistent operator, we then derive sufficient conditions for an operator to preserve optimality, leading to a family of operators which includes our consistent Bellman operator. As corollaries we provide a proof of optimality for Baird's advantage learning algorithm and derive other gap-increasing operators with interesting properties. We conclude with an empirical study on 60 Atari 2600 games illustrating the strong potential of these new operators.

Increasing the Action Gap: New Operators for Reinforcement Learning

TL;DR

The paper introduces a consistent Bellman operator that enforces local policy consistency, increases the action gap, and preserves optimality, with extensions to aggregation and Q-value interpolation for continuous spaces. It then broadens to a family of optimality-preserving, gap-increasing operators applicable to general function approximators, including convergent variants like Baird's Advantage Learning and PAL. Theoretical results guarantee optimality preservation under weak conditions, while empirical experiments on a bicycle domain and Atari 2600 games demonstrate substantial performance gains and reduced value-estimation bias. The work suggests that rethinking core value-iteration updates can yield significant practical benefits in high-frequency or large-scale RL settings.

Abstract

This paper introduces new optimality-preserving operators on Q-functions. We first describe an operator for tabular representations, the consistent Bellman operator, which incorporates a notion of local policy consistency. We show that this local consistency leads to an increase in the action gap at each state; increasing this gap, we argue, mitigates the undesirable effects of approximation and estimation errors on the induced greedy policies. This operator can also be applied to discretized continuous space and time problems, and we provide empirical results evidencing superior performance in this context. Extending the idea of a locally consistent operator, we then derive sufficient conditions for an operator to preserve optimality, leading to a family of operators which includes our consistent Bellman operator. As corollaries we provide a proof of optimality for Baird's advantage learning algorithm and derive other gap-increasing operators with interesting properties. We conclude with an empirical study on 60 Atari 2600 games illustrating the strong potential of these new operators.

Paper Structure

This paper contains 19 sections, 7 theorems, 68 equations, 7 figures, 1 table.

Key Result

Theorem 1

Let $\mathcal{T}$ be the Bellman operator defined by (eqn:bellman_operator). Let $\mathcal{T}'$ be an operator with the property that there exists an $\alpha \in [0, 1)$ such that for all $Q \in \mathcal{Q}$, $x \in \mathcal{X}, a \in \mathcal{A}$, and letting $V(x) := \max\nolimits_b Q(x,b)$, Then $\mathcal{T}'$ is both optimality-preserving and gap-increasing.

Figures (7)

  • Figure 1: A two-state MDP illustrating the non-stationary aspect of the Bellman operator. Here, $p$ and $r$ indicate transition probabilities and rewards, respectively. In state $x_1$ the agent may either eat cake to receive a reward of 1 and transition to $x_2$ with probability $\tfrac{1}{2}$, or abstain for no reward. State $x_2$ is a low-value absorbing state with $\epsilon > 0$.
  • Figure 2: Multilinear interpolation in two dimensions. The value at $x$ is approximated as $V(x) := \mathop{\mathrm{\mathlarger {\mathbf E}}}\limits\nolimits_{z' \sim A(\cdot \, | \, x)} V(z')$. Here $A(z_1 \, | \, x) = (1 - \alpha) (1 - \beta)$, $A(z_2 \, | \, x) = \alpha (1 - \beta)$, etc.
  • Figure 3: Top. Falling and goal-reaching frequency for greedy policies derived from value iteration. Bottom. Sample bicycle trajectories after $100, 200, \dots, 1000$ iterations. In this coarse-resolution regime, the Bellman operator initially yields policies which circle the goal forever, while the consistent operator quickly yields successful trajectories.
  • Figure 4: Learning curves for two Atari 2600 games in the Original DQN setting.
  • Figure 5: Action gaps (left) and value functions (right) for a single episode of Space Invaders (Original DQN setting). Our operators yield markedly increased action gaps and lower values.
  • ...and 2 more figures

Theorems & Definitions (13)

  • Definition 1
  • Definition 2
  • Theorem 1
  • Remark 1
  • Corollary 1
  • Corollary 2
  • Lemma 1
  • proof
  • Lemma 2
  • proof
  • ...and 3 more