Table of Contents
Fetching ...

Beyond Expected Return: Accounting for Policy Reproducibility when Evaluating Reinforcement Learning Algorithms

Manon Flageat, Bryan Lim, Antoine Cully

TL;DR

The paper addresses the gap in RL evaluation under uncertainty by formalising policy reproducibility as the dispersion of episodic returns and introducing robust metrics. It proposes Mean Absolute Deviation (MAD) to quantify reproducibility and Lower Confidence Bound (LCB) to trade off performance against reproducibility, enabling practitioners to set a preference via a parameter $\alpha$. Through experiments on continuous-control tasks with multiple noise types, the authors show a tangible performance-reproducibility trade-off and demonstrate that Evolution Strategies (and especially a reproducibility-optimised variant, R-ES) yield more reproducible policies, while still allowing adjustments through LCB. They also extend the framework to behavioural reproducibility, using descriptor- and state-m marginal-based representations, highlighting practical implications for real-world deployments where consistent behavior is crucial.

Abstract

Many applications in Reinforcement Learning (RL) usually have noise or stochasticity present in the environment. Beyond their impact on learning, these uncertainties lead the exact same policy to perform differently, i.e. yield different return, from one roll-out to another. Common evaluation procedures in RL summarise the consequent return distributions using solely the expected return, which does not account for the spread of the distribution. Our work defines this spread as the policy reproducibility: the ability of a policy to obtain similar performance when rolled out many times, a crucial property in some real-world applications. We highlight that existing procedures that only use the expected return are limited on two fronts: first an infinite number of return distributions with a wide range of performance-reproducibility trade-offs can have the same expected return, limiting its effectiveness when used for comparing policies; second, the expected return metric does not leave any room for practitioners to choose the best trade-off value for considered applications. In this work, we address these limitations by recommending the use of Lower Confidence Bound, a metric taken from Bayesian optimisation that provides the user with a preference parameter to choose a desired performance-reproducibility trade-off. We also formalise and quantify policy reproducibility, and demonstrate the benefit of our metrics using extensive experiments of popular RL algorithms on common uncertain RL tasks.

Beyond Expected Return: Accounting for Policy Reproducibility when Evaluating Reinforcement Learning Algorithms

TL;DR

The paper addresses the gap in RL evaluation under uncertainty by formalising policy reproducibility as the dispersion of episodic returns and introducing robust metrics. It proposes Mean Absolute Deviation (MAD) to quantify reproducibility and Lower Confidence Bound (LCB) to trade off performance against reproducibility, enabling practitioners to set a preference via a parameter . Through experiments on continuous-control tasks with multiple noise types, the authors show a tangible performance-reproducibility trade-off and demonstrate that Evolution Strategies (and especially a reproducibility-optimised variant, R-ES) yield more reproducible policies, while still allowing adjustments through LCB. They also extend the framework to behavioural reproducibility, using descriptor- and state-m marginal-based representations, highlighting practical implications for real-world deployments where consistent behavior is crucial.

Abstract

Many applications in Reinforcement Learning (RL) usually have noise or stochasticity present in the environment. Beyond their impact on learning, these uncertainties lead the exact same policy to perform differently, i.e. yield different return, from one roll-out to another. Common evaluation procedures in RL summarise the consequent return distributions using solely the expected return, which does not account for the spread of the distribution. Our work defines this spread as the policy reproducibility: the ability of a policy to obtain similar performance when rolled out many times, a crucial property in some real-world applications. We highlight that existing procedures that only use the expected return are limited on two fronts: first an infinite number of return distributions with a wide range of performance-reproducibility trade-offs can have the same expected return, limiting its effectiveness when used for comparing policies; second, the expected return metric does not leave any room for practitioners to choose the best trade-off value for considered applications. In this work, we address these limitations by recommending the use of Lower Confidence Bound, a metric taken from Bayesian optimisation that provides the user with a preference parameter to choose a desired performance-reproducibility trade-off. We also formalise and quantify policy reproducibility, and demonstrate the benefit of our metrics using extensive experiments of popular RL algorithms on common uncertain RL tasks.
Paper Structure (45 sections, 1 equation, 23 figures, 8 tables)

This paper contains 45 sections, 1 equation, 23 figures, 8 tables.

Figures (23)

  • Figure 1: Illustration of the trade-off between policy reproducibility and performance. Policy $A$ (blue) cooks the best-breakfast-ever-made 85% of the time, but burns all the eggs the remaining 15%, while Policy $B$ (green) consistently cooks a lower-quality breakfast. On the distribution of returns (bottom), Policy $A$ and $B$ have the same expected return, highlighting the limitations of this commonly-used metric.
  • Figure 2: Illustration of the different uncertainties that can be applied within the RL evaluation setting: (1) stochastic dynamics, (2) random initialisation, (3) parameter-space noise, (4) reward noise, (5) observation noise and (6) action noise.
  • Figure 3: In non-uncertain environments (1), each policy has a fixed return; while in uncertain environments (2, 3), policies have distributions over possible returns. In homoscedastic uncertain environments (2) this distribution is the same for every policy, while in heteroscedastic uncertain environments (3), each policy can take different distribution parameters. The trade-off between policy reproducibility and performance only arises in heteroscedastic uncertain environments (3).
  • Figure 4: Illustration of the MAD and IQR metrics that quantify policy reproducibility for a given return distribution. IQR corresponds to the distance between first and third quartiles, while MAD corresponds to the median distance to the median of the distribution. As a consequence, half the sampled evaluations are closer to the median than the MAD, and half are further away, as illustrated in the figure.
  • Figure 5: MAD scores of final policies in the Ant environment. y-axis is the type of uncertainty present in the environment. We report the IQM and the CIs across 10 seeds. The lower the MAD score the more reproducible the policy.
  • ...and 18 more figures