Table of Contents
Fetching ...

Empirical Design in Reinforcement Learning

Andrew Patterson, Samuel Neumann, Martha White, Adam White

TL;DR

This manuscript represents both a call to action, and a comprehensive resource for how to do good experiments in reinforcement learning, covering the statistical assumptions underlying common performance measures, how to properly characterize performance variation and stability, and how to deal with hyper-parameters and experimenter bias.

Abstract

Empirical design in reinforcement learning is no small task. Running good experiments requires attention to detail and at times significant computational resources. While compute resources available per dollar have continued to grow rapidly, so have the scale of typical experiments in reinforcement learning. It is now common to benchmark agents with millions of parameters against dozens of tasks, each using the equivalent of 30 days of experience. The scale of these experiments often conflict with the need for proper statistical evidence, especially when comparing algorithms. Recent studies have highlighted how popular algorithms are sensitive to hyper-parameter settings and implementation details, and that common empirical practice leads to weak statistical evidence (Machado et al., 2018; Henderson et al., 2018). Here we take this one step further. This manuscript represents both a call to action, and a comprehensive resource for how to do good experiments in reinforcement learning. In particular, we cover: the statistical assumptions underlying common performance measures, how to properly characterize performance variation and stability, hypothesis testing, special considerations for comparing multiple agents, baseline and illustrative example construction, and how to deal with hyper-parameters and experimenter bias. Throughout we highlight common mistakes found in the literature and the statistical consequences of those in example experiments. The objective of this document is to provide answers on how we can use our unprecedented compute to do good science in reinforcement learning, as well as stay alert to potential pitfalls in our empirical design.

Empirical Design in Reinforcement Learning

TL;DR

This manuscript represents both a call to action, and a comprehensive resource for how to do good experiments in reinforcement learning, covering the statistical assumptions underlying common performance measures, how to properly characterize performance variation and stability, and how to deal with hyper-parameters and experimenter bias.

Abstract

Empirical design in reinforcement learning is no small task. Running good experiments requires attention to detail and at times significant computational resources. While compute resources available per dollar have continued to grow rapidly, so have the scale of typical experiments in reinforcement learning. It is now common to benchmark agents with millions of parameters against dozens of tasks, each using the equivalent of 30 days of experience. The scale of these experiments often conflict with the need for proper statistical evidence, especially when comparing algorithms. Recent studies have highlighted how popular algorithms are sensitive to hyper-parameter settings and implementation details, and that common empirical practice leads to weak statistical evidence (Machado et al., 2018; Henderson et al., 2018). Here we take this one step further. This manuscript represents both a call to action, and a comprehensive resource for how to do good experiments in reinforcement learning. In particular, we cover: the statistical assumptions underlying common performance measures, how to properly characterize performance variation and stability, hypothesis testing, special considerations for comparing multiple agents, baseline and illustrative example construction, and how to deal with hyper-parameters and experimenter bias. Throughout we highlight common mistakes found in the literature and the statistical consequences of those in example experiments. The objective of this document is to provide answers on how we can use our unprecedented compute to do good science in reinforcement learning, as well as stay alert to potential pitfalls in our empirical design.
Paper Structure (67 sections, 3 equations, 19 figures)

This paper contains 67 sections, 3 equations, 19 figures.

Figures (19)

  • Figure 1: Two-stage approach to comparing two algorithms. The goal of this experimental workflow is to progress left to right. Good choices in the colored boxes should limit the experiment rerunning sometimes forced by the yellow decision diamonds. The same basic process is used with automatic hyperparameter optimization algorithms (HOA), except the HOA chooses what hyperparameters to test from a specified range (instead of a set) and sometimes early stopping is used (during the first red Run experiment box). You should still use multiple runs to evaluate the hyperparameters within the HOA and after it is done you should check if the best hyperparameters are at the edge of the ranges. From there, one would proceed with Run experiment---second red box---and continue with the workflow. Each stage of the work flow is discussed in detail in the text; we link each for convenience. Section \ref{['obs-exps']}: discusses how to decide on key experiment details (in \ref{['sec_steps']}), the basics on how to run an experiment (in \ref{['sec_first']}), plot learning curves and confidence intervals (in \ref{['sec_ci']}), and decide if we need more runs (in \ref{['sec_moreruns']}). Section \ref{['sec_hypers']}: discusses how to construct hyperparameter sets, and how to determine if you need to expand the hyperparameter set (in \ref{['hyper_overall']}). Section \ref{['sec_twoalgs']}: discusses ways to compare two (or more) algorithms, including how to detect if the changes you have made to a baseline significantly improve performance.
  • Figure 2: A single run of an Expected Sarsa agent on a simple, tabular maze environment. The return rate for this agent is $M = 0.827$. The agent has near optimal performance near the end of the curve, as it reliably reaches the goal in 15 to 17 steps, with the return hovering around $0.99^{17} = 0.84$ to $0.99^{15} = 0.86$.
  • Figure 3: Understanding variability in agents. (a) 30 individual E-Sarsa agents on a simple, tabular maze environment. The thick black line shows the mean over individual agents over time. (b) Raw data from 10 rats running a water maze.
  • Figure 4: Tolerance intervals for the discounted return of DQN on Mountain Car, over 50 runs with $\alpha = 0.05$. Recall $\beta$ specifies the percentage of the distribution considered for the tolerance interval. (a) Tolerance interval with $\beta = 0.9$, with mean performance. (b) Tolerance interval with $\beta = 0.7$, with mean performance---notice the later is tighter. (c) Tolerance interval with $\beta = 0.7$, with median performance.
  • Figure 5: Confidence intervals around the discounted return of DQN on Mountain Car averaged over 30 runs. (a) Student's t-distribution confidence interval with $\alpha = 0.05$. (b) Student's t-distribution confidence interval with $\alpha = 0.3$. (c) Bootstrap confidence interval with $\alpha = 0.05$, with mean performance.
  • ...and 14 more figures