The Effective Horizon Explains Deep RL Performance in Stochastic Environments

Cassidy Laidlaw; Banghua Zhu; Stuart Russell; Anca Dragan

The Effective Horizon Explains Deep RL Performance in Stochastic Environments

Cassidy Laidlaw, Banghua Zhu, Stuart Russell, Anca Dragan

TL;DR

The paper addresses why deep RL with random exploration and neural function approximators often succeeds in stochastic environments, where theory traditionally predicts poor performance. It introduces the stochastic effective horizon and a model-free algorithm, SQIRL, which alternates random data collection with short, regression-based Q-value iterations to approximate a few steps of value iteration. By formalizing a regression oracle and analyzing k-QVI-solvable MDPs, the authors derive instance-dependent sample complexity bounds that scale exponentially only with the effective horizon ${\bar{H}}$ and the function-class complexity, rather than the full horizon. Empirically, SQIRL performs competitively with PPO and DQN on Bridge MDPs and Atari games, supporting the claim that the effective horizon and regression-based learning can predict practical deep RL performance. Overall, the work provides a principled explanation for deep RL success and suggests a separation of exploration from learning as a fruitful analytical lens.

Abstract

Reinforcement learning (RL) theory has largely focused on proving minimax sample complexity bounds. These require strategic exploration algorithms that use relatively limited function classes for representing the policy or value function. Our goal is to explain why deep RL algorithms often perform well in practice, despite using random exploration and much more expressive function classes like neural networks. Our work arrives at an explanation by showing that many stochastic MDPs can be solved by performing only a few steps of value iteration on the random policy's Q function and then acting greedily. When this is true, we find that it is possible to separate the exploration and learning components of RL, making it much easier to analyze. We introduce a new RL algorithm, SQIRL, that iteratively learns a near-optimal policy by exploring randomly to collect rollouts and then performing a limited number of steps of fitted-Q iteration over those rollouts. Any regression algorithm that satisfies basic in-distribution generalization properties can be used in SQIRL to efficiently solve common MDPs. This can explain why deep RL works, since it is empirically established that neural networks generalize well in-distribution. Furthermore, SQIRL explains why random exploration works well in practice. We leverage SQIRL to derive instance-dependent sample complexity bounds for RL that are exponential only in an "effective horizon" of lookahead and on the complexity of the class used for function approximation. Empirically, we also find that SQIRL performance strongly correlates with PPO and DQN performance in a variety of stochastic environments, supporting that our theoretical analysis is predictive of practical performance. Our code and data are available at https://github.com/cassidylaidlaw/effective-horizon.

The Effective Horizon Explains Deep RL Performance in Stochastic Environments

TL;DR

and the function-class complexity, rather than the full horizon. Empirically, SQIRL performs competitively with PPO and DQN on Bridge MDPs and Atari games, supporting the claim that the effective horizon and regression-based learning can predict practical deep RL performance. Overall, the work provides a principled explanation for deep RL success and suggests a separation of exploration from learning as a fruitful analytical lens.

Abstract

Paper Structure (19 sections, 9 theorems, 50 equations, 8 figures, 6 tables, 2 algorithms)

This paper contains 19 sections, 9 theorems, 50 equations, 8 figures, 6 tables, 2 algorithms.

Introduction
Setup and Related Work
Related work
The Stochastic Effective Horizon and SQIRL
SQIRL
Experiments
Full-length Atari games
Conclusion
Least-squares regression oracles
VC-type hypothesis classes
Proofs
Proof of Lemma \ref{['lemma:efhorizonrelationship']}
Proof of Theorem \ref{['theorem:stochastic_gorp_sample_complexity']}
Proof of Theorem \ref{['theorem:vc_type_erm']}
Experiment details
...and 4 more sections

Key Result

Lemma 3.3

The deterministic effective horizon $H$ is bounded as Furthermore, if an MDP is $k$-QVI-solvable, then with probability at least $1 - \delta$, GORP will return an optimal policy with sample complexity at most $O( k T^2 A^{{\bar{H}}_k} \log \left(T A / \delta\right) )$.

Figures (8)

Figure 1: We introduce the shallow Q-iteration via reinforcement learning (SQIRL) algorithm, which uses random exploration and function approximation to efficiently solve environments with a low stochastic effective horizon. SQIRL is a generalization of the GORP algorithm laidlaw_bridging_2023 to stochastic environments. In the figure, both algorithms are shown solving the first timestep of a 2-QVI-solvable MDP. The GORP algorithm (left) uses random rollouts to estimate the random policy's Q-values at the leaf nodes of a "search tree" and then backs up these values to the root node. It is challenging to generalize this algorithm to stochastic environments because both the initial state and transition dynamics are random. This makes it impossible to perform the steps of GORP where it averages over random rollouts and backs up values along deterministic transitions. SQIRL replaces these steps with regression of the random policy's Q-values at leaf nodes and fitted Q-iteration (FQI) for backing up values, allowing it to efficiently learn in stochastic environments.
Figure 2: Among sticky-action versions of the MDPs in the Bridge dataset, more than half can be approximately solved by acting greedily with respect to the random policy's Q-function ($k = 1$); many more can be by applying just a few steps of Q-value iteration before acting greedily ($2 \leq k \leq 5$). When $k$ is low, we observe that deep RL algorithms like PPO are much more likely to solve the environment.
Figure 3: The empirical sample complexity of SQIRL correlates closely with that of PPO and DQN, suggesting that our theoretical analysis of SQIRL is a powerful tool for understanding when and why deep RL works in stochastic environments.
Figure 4: The performance of SQIRL in standard full-length Atari environments is comparable to PPO and DQN. This suggests that PPO and DQN succeed in standard benchmarks for similar reasons that SQIRL succeeds. Thus, our theoretical analysis of SQIRL based on the effective horizon can help explain deep RL performance in these environments.
Figure 5: Learning curves for PPO, DQN, SQIRL, and SQIRL on the sticky-action Bridge MDPs. Solid lines show the median return (over 5 random seeds) of the policies learned by each algorithm throughout training. The shaded region shows the range of returns over random seeds. The optimal return in each environment is shown as the dashed black line.
...and 3 more figures

Theorems & Definitions (23)

Definition 3.1: $k$-QVI-solvable
Definition 3.2: $k$-gap
Definition 3.3: Stochastic effective horizon
Lemma 3.3
Theorem 3.4: sample complexity
Definition A.1
Definition A.2
Definition A.3
Theorem A.3
Example A.4
...and 13 more

The Effective Horizon Explains Deep RL Performance in Stochastic Environments

TL;DR

Abstract

The Effective Horizon Explains Deep RL Performance in Stochastic Environments

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (8)

Theorems & Definitions (23)