Averaging $n$-step Returns Reduces Variance in Reinforcement Learning

Brett Daley; Martha White; Marlos C. Machado

Averaging $n$-step Returns Reduces Variance in Reinforcement Learning

Brett Daley, Martha White, Marlos C. Machado

TL;DR

The paper identifies variance as a key bottleneck in long-horizon multistep RL targets and proves that averaging multiple $n$-step returns into compound returns reduces variance when contraction properties are matched. It provides a general variance model for compound returns, establishes a variance-reduction theorem, and offers a finite-time convergence bound under linear function approximation. To translate theory into practice, the authors introduce PiLaR, a two-bootstrap, computationally light approximation that preserves variance reduction and matches the effective behavior of $\lambda$-returns. Empirical results in DQN and PPO demonstrate improved sample efficiency in several tasks, highlighting the practical impact of variance-aware multistep targets for both off-policy and on-policy deep RL.

Abstract

Multistep returns, such as $n$-step returns and $λ$-returns, are commonly used to improve the sample efficiency of reinforcement learning (RL) methods. The variance of the multistep returns becomes the limiting factor in their length; looking too far into the future increases variance and reverses the benefits of multistep learning. In our work, we demonstrate the ability of compound returns -- weighted averages of $n$-step returns -- to reduce variance. We prove for the first time that any compound return with the same contraction modulus as a given $n$-step return has strictly lower variance. We additionally prove that this variance-reduction property improves the finite-sample complexity of temporal-difference learning under linear function approximation. Because general compound returns can be expensive to implement, we introduce two-bootstrap returns which reduce variance while remaining efficient, even when using minibatched experience replay. We conduct experiments showing that compound returns often increase the sample efficiency of $n$-step deep RL agents like DQN and PPO.

Averaging $n$-step Returns Reduces Variance in Reinforcement Learning

TL;DR

The paper identifies variance as a key bottleneck in long-horizon multistep RL targets and proves that averaging multiple

-step returns into compound returns reduces variance when contraction properties are matched. It provides a general variance model for compound returns, establishes a variance-reduction theorem, and offers a finite-time convergence bound under linear function approximation. To translate theory into practice, the authors introduce PiLaR, a two-bootstrap, computationally light approximation that preserves variance reduction and matches the effective behavior of

-returns. Empirical results in DQN and PPO demonstrate improved sample efficiency in several tasks, highlighting the practical impact of variance-aware multistep targets for both off-policy and on-policy deep RL.

Abstract

Multistep returns, such as

-step returns and

-returns, are commonly used to improve the sample efficiency of reinforcement learning (RL) methods. The variance of the multistep returns becomes the limiting factor in their length; looking too far into the future increases variance and reverses the benefits of multistep learning. In our work, we demonstrate the ability of compound returns -- weighted averages of

-step returns -- to reduce variance. We prove for the first time that any compound return with the same contraction modulus as a given

-step return has strictly lower variance. We additionally prove that this variance-reduction property improves the finite-sample complexity of temporal-difference learning under linear function approximation. Because general compound returns can be expensive to implement, we introduce two-bootstrap returns which reduce variance while remaining efficient, even when using minibatched experience replay. We conduct experiments showing that compound returns often increase the sample efficiency of

-step deep RL agents like DQN and PPO.

Paper Structure (27 sections, 20 theorems, 77 equations, 7 figures, 3 tables, 1 algorithm)

This paper contains 27 sections, 20 theorems, 77 equations, 7 figures, 3 tables, 1 algorithm.

Introduction
Background
Variance Analysis
Characterizing Compound-Return Variance
Error-Reduction Property and Effective $n$-step
The Variance-Reduction Property
Finite-Time Analysis
Case Study: λ-returns
Piecewise λ-Returns
Deep RL Experiments
PPO
Conclusion
Variance Assumptions
How Realistic Is the Proposed Variance Model?
Proofs
...and 12 more sections

Key Result

Proposition 3.0

The variance of an $n$-step return is

Figures (7)

Figure 1: Comparing $n$-step returns and $\lambda$-returns, paired by COM, in a random walk. Dashed lines indicate the lowest errors attained.
Figure 2: TD-error weights for Pilar and a $\lambda$-return (${\lambda=0.904}$). Both returns have the same contraction modulus as a 10-step return when $\gamma=0.99$.
Figure 3: Learning curves for DQN with $n$-step returns and Pilars in five MinAtar games.
Figure 4: Learning curves for PPO with $n$-step returns and $\lambda$-returns in three MuJoCo environments.
Figure 5: Variances of the $n$-step returns originating from the initial state in three environments. The solid green line indicates the true variance while the dashed black lines indicate the lower and upper bounds predicted by our $n$-step variance model (\ref{['prop:nstep_variance']}).
...and 2 more figures

Theorems & Definitions (39)

Proposition 3.0
proof
Lemma 3.0
Proposition 3.0
proof : Proofs
Proposition 3.0: Effective $n$-step of compound return
proof
Theorem 3.1: Variance-reduction property of compound returns
proof
Corollary 3.1: Variance reduction of $\lambda$-return
...and 29 more

Averaging $n$-step Returns Reduces Variance in Reinforcement Learning

TL;DR

Abstract

Averaging $n$-step Returns Reduces Variance in Reinforcement Learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (7)

Theorems & Definitions (39)