Table of Contents
Fetching ...

Position: Benchmarking is Limited in Reinforcement Learning Research

Scott M. Jordan, Adam White, Bruno Castro da Silva, Martha White, Philip S. Thomas

TL;DR

It is shown that conducting rigorous performance benchmarks will likely have computational costs that are often prohibitive, and it is argued for using an additional experimentation paradigm to overcome the limitations of benchmarking.

Abstract

Novel reinforcement learning algorithms, or improvements on existing ones, are commonly justified by evaluating their performance on benchmark environments and are compared to an ever-changing set of standard algorithms. However, despite numerous calls for improvements, experimental practices continue to produce misleading or unsupported claims. One reason for the ongoing substandard practices is that conducting rigorous benchmarking experiments requires substantial computational time. This work investigates the sources of increased computation costs in rigorous experiment designs. We show that conducting rigorous performance benchmarks will likely have computational costs that are often prohibitive. As a result, we argue for using an additional experimentation paradigm to overcome the limitations of benchmarking.

Position: Benchmarking is Limited in Reinforcement Learning Research

TL;DR

It is shown that conducting rigorous performance benchmarks will likely have computational costs that are often prohibitive, and it is argued for using an additional experimentation paradigm to overcome the limitations of benchmarking.

Abstract

Novel reinforcement learning algorithms, or improvements on existing ones, are commonly justified by evaluating their performance on benchmark environments and are compared to an ever-changing set of standard algorithms. However, despite numerous calls for improvements, experimental practices continue to produce misleading or unsupported claims. One reason for the ongoing substandard practices is that conducting rigorous benchmarking experiments requires substantial computational time. This work investigates the sources of increased computation costs in rigorous experiment designs. We show that conducting rigorous performance benchmarks will likely have computational costs that are often prohibitive. As a result, we argue for using an additional experimentation paradigm to overcome the limitations of benchmarking.

Paper Structure

This paper contains 21 sections, 5 equations, 14 figures, 1 table.

Figures (14)

  • Figure 1: This plot shows the average width, over all algorithms, of the bootstrapped $95\%$ confidence intervals versus the number of samples (seeds) of each $X_{i,j}$. Different colors indicate a different aggregate weighting method. The shaded regions represent standard deviations of average confidence interval width. A total of $1,\!000$ independent trials of the evaluation procedure were executed for each sample size. Note that a confidence interval width of $10^{-1}$ is substantial because the performance ratio will generally keep the aggregate performance in the range $[0,1]$. Additionally, this level of confidence is only achieved after using more than $100$ random seeds per algorithm-environment pair. So it will take many random seeds to make statistically significant comparisons between all algorithms.
  • Figure 2: This plot shows the coverage probability of the bootstrapped $95\%$ confidence intervals at each sample size (number of random seeds). The shaded region represents $95\%$ confidence intervals of the coverage probability using the Clopper--Pearson method clopper1934binom. The dotted line indicates the target failure rate of $0.05$. Confidence interval methods can only be relied when the failure is at or below the target level. In this case, the bootstrap method is only usable when at least $10$ (Adversarial-Both), $50$ (Uniform-Both) or $100$ (Adversarial-Env) samples per algorithm-environment pair are available.
  • Figure 3: This plot illustrates the aggregate performance measure and confidence intervals for each algorithm. These results used $50$ seeds for each algorithm-environment pair, and each algorithm was run on five environments. The algorithms with blue confidence intervals indicate that any one of them could be the best algorithm, i.e., a statistically significant difference can not be detected. We use the number of these algorithms as the measure of uncertainty for the plots in Figure \ref{['fig:numalgsenvseval']}.
  • Figure 4: This plot shows the average number of algorithms that have overlapping confidence intervals with the best algorithm. The error bars represent the standard deviations. The solid lines correspond to using adversarial weightings and the dashed lines for uniform weightings. (Top) Each line color corresponds to a different group of environments denoted by the number of environments. (Bottom) Each line color corresponds to a different group of algorithms denoted by the number of algorithms. $3\ \text{(Sep)}$ and $3\ \text{(Sim)}$ correspond to the algorithm sets that are well separated and similar in performance, respectively.
  • Figure 5: (Left) This plot shows the return for each algorithm over the number of episodes. (Right) This plot shows the distance of the learned policy's value function to the $\epsilon$-greedy optimal policy. Each line represents the average value computed from $100$ trials, and the shaded regions correspond to the standard deviation. Each color corresponds to a different algorithm.
  • ...and 9 more figures