Table of Contents
Fetching ...

Bridging RL Theory and Practice with the Effective Horizon

Cassidy Laidlaw, Stuart Russell, Anca Dragan

TL;DR

The paper introduces the effective horizon, a principled complexity measure for MDPs that captures how far ahead an agent must plan before leaf evaluations with random rollouts, to explain RL performance. It pairs this theory with the Bridge dataset of 155 deterministic, tabular MDPs to derive instance-dependent bounds, formalizes the Greedy Over Random Policy (GORP) algorithm, and proves horizon-based sample complexity $N \le T^2 A^{H}$. Empirically, bounds based on the effective horizon correlate more tightly with PPO and DQN performance than prior bounds and predict the effects of reward shaping and pretrained exploration policies. The work also shows that a surprising fraction of environments allow near-greedy behavior on the random policy to be optimal, offering practical intuition and new algorithmic avenues. While focused on deterministic, discrete-action environments, the results illuminate how theory can better align with empirical RL and point to promising future extensions to stochastic settings and generalization.

Abstract

Deep reinforcement learning (RL) works impressively in some environments and fails catastrophically in others. Ideally, RL theory should be able to provide an understanding of why this is, i.e. bounds predictive of practical performance. Unfortunately, current theory does not quite have this ability. We compare standard deep RL algorithms to prior sample complexity bounds by introducing a new dataset, BRIDGE. It consists of 155 deterministic MDPs from common deep RL benchmarks, along with their corresponding tabular representations, which enables us to exactly compute instance-dependent bounds. We choose to focus on deterministic environments because they share many interesting properties of stochastic environments, but are easier to analyze. Using BRIDGE, we find that prior bounds do not correlate well with when deep RL succeeds vs. fails, but discover a surprising property that does. When actions with the highest Q-values under the random policy also have the highest Q-values under the optimal policy (i.e. when it is optimal to be greedy on the random policy's Q function), deep RL tends to succeed; when they don't, deep RL tends to fail. We generalize this property into a new complexity measure of an MDP that we call the effective horizon, which roughly corresponds to how many steps of lookahead search would be needed in that MDP in order to identify the next optimal action, when leaf nodes are evaluated with random rollouts. Using BRIDGE, we show that the effective horizon-based bounds are more closely reflective of the empirical performance of PPO and DQN than prior sample complexity bounds across four metrics. We also find that, unlike existing bounds, the effective horizon can predict the effects of using reward shaping or a pre-trained exploration policy. Our code and data are available at https://github.com/cassidylaidlaw/effective-horizon

Bridging RL Theory and Practice with the Effective Horizon

TL;DR

The paper introduces the effective horizon, a principled complexity measure for MDPs that captures how far ahead an agent must plan before leaf evaluations with random rollouts, to explain RL performance. It pairs this theory with the Bridge dataset of 155 deterministic, tabular MDPs to derive instance-dependent bounds, formalizes the Greedy Over Random Policy (GORP) algorithm, and proves horizon-based sample complexity . Empirically, bounds based on the effective horizon correlate more tightly with PPO and DQN performance than prior bounds and predict the effects of reward shaping and pretrained exploration policies. The work also shows that a surprising fraction of environments allow near-greedy behavior on the random policy to be optimal, offering practical intuition and new algorithmic avenues. While focused on deterministic, discrete-action environments, the results illuminate how theory can better align with empirical RL and point to promising future extensions to stochastic settings and generalization.

Abstract

Deep reinforcement learning (RL) works impressively in some environments and fails catastrophically in others. Ideally, RL theory should be able to provide an understanding of why this is, i.e. bounds predictive of practical performance. Unfortunately, current theory does not quite have this ability. We compare standard deep RL algorithms to prior sample complexity bounds by introducing a new dataset, BRIDGE. It consists of 155 deterministic MDPs from common deep RL benchmarks, along with their corresponding tabular representations, which enables us to exactly compute instance-dependent bounds. We choose to focus on deterministic environments because they share many interesting properties of stochastic environments, but are easier to analyze. Using BRIDGE, we find that prior bounds do not correlate well with when deep RL succeeds vs. fails, but discover a surprising property that does. When actions with the highest Q-values under the random policy also have the highest Q-values under the optimal policy (i.e. when it is optimal to be greedy on the random policy's Q function), deep RL tends to succeed; when they don't, deep RL tends to fail. We generalize this property into a new complexity measure of an MDP that we call the effective horizon, which roughly corresponds to how many steps of lookahead search would be needed in that MDP in order to identify the next optimal action, when leaf nodes are evaluated with random rollouts. Using BRIDGE, we show that the effective horizon-based bounds are more closely reflective of the empirical performance of PPO and DQN than prior sample complexity bounds across four metrics. We also find that, unlike existing bounds, the effective horizon can predict the effects of using reward shaping or a pre-trained exploration policy. Our code and data are available at https://github.com/cassidylaidlaw/effective-horizon
Paper Structure (47 sections, 15 theorems, 111 equations, 11 figures, 7 tables, 3 algorithms)

This paper contains 47 sections, 15 theorems, 111 equations, 11 figures, 7 tables, 3 algorithms.

Key Result

Theorem 2.1

There is an RL algorithm which can solve any deterministic MDP with sample complexity $N \leq T \lceil A^T / 2 \rceil$. Conversely, for any RL algorithm and any values of $T$ and $A$, there must be some deterministic MDP for which its sample complexity $N \geq T (\lceil A^T / 2 \rceil - 1)$.

Figures (11)

  • Figure 1: We introduce the effective horizon, a property of MDPs that controls how difficult RL is. Our analysis is motivated by Greedy Over Random Policy (GORP), a simple Monte Carlo planning algorithm (left) that exhaustively explores action sequences of length $k$ and then uses $m_{}$ random rollouts to evaluate each leaf node. The effective horizon combines both $k$ and $m_{}$ into a single measure. We prove sample complexity bounds based on the effective horizon that correlate closely with the real performance of PPO, a deep RL algorithm, on our Bridge dataset of 155 deterministic MDPs (right).
  • Figure 2: Examples of calculating the effective horizon $H_{}$ using Theorem \ref{['thm:deterministicefhorizon']}; see Section \ref{['sec:efhorizon_examples']} for the details.
  • Figure 3: Learning curves for PPO, DQN, and GORP on full-horizon Atari games. We use 5 random seeds for all algorithms. The solid line shows the median return throughout training while the shaded region shows the range of returns over random seeds.
  • Figure 4: Empty-5x5, one of the Minigrid MDPs from Bridge and an example of a goal MDP (Definition \ref{['definition:goal_mdp']}). The agent (red triangle) can turn left, turn right, or go forward, and its goal is to reach the green square, which gives a reward of 1.
  • Figure 5: Our Bridge dataset consists of 155 deterministic MDPs with full tabular representations. We include MDPs from three popular RL benchmarks which cover a range of state space sizes, action state sizes, and horizons.
  • ...and 6 more figures

Theorems & Definitions (33)

  • Theorem 2.1
  • Definition 5.1: $k$-QVI-solvable
  • Definition 5.1: Effective horizon
  • Lemma 5.1
  • Theorem 5.2
  • Theorem A.1
  • proof
  • Lemma A.1
  • proof
  • Lemma A.1
  • ...and 23 more