Table of Contents
Fetching ...

Long N-step Surrogate Stage Reward to Reduce Variances of Deep Reinforcement Learning in Complex Problems

Junmin Zhong, Ruofan Wu, Jennie Si

TL;DR

This study introduces a new long $N$-step surrogate stage (LNSS) reward approach to effectively account for complex environment dynamics while previous methods are usually feasible for limited number of steps and provides analytical insights on how LNSS exponentially reduces the upper bound on the variances of Q value from a respective single step method.

Abstract

High variances in reinforcement learning have shown impeding successful convergence and hurting task performance. As reward signal plays an important role in learning behavior, multi-step methods have been considered to mitigate the problem, and are believed to be more effective than single step methods. However, there is a lack of comprehensive and systematic study on this important aspect to demonstrate the effectiveness of multi-step methods in solving highly complex continuous control problems. In this study, we introduce a new long $N$-step surrogate stage (LNSS) reward approach to effectively account for complex environment dynamics while previous methods are usually feasible for limited number of steps. The LNSS method is simple, low computational cost, and applicable to value based or policy gradient reinforcement learning. We systematically evaluate LNSS in OpenAI Gym and DeepMind Control Suite to address some complex benchmark environments that have been challenging to obtain good results by DRL in general. We demonstrate performance improvement in terms of total reward, convergence speed, and coefficient of variation (CV) by LNSS. We also provide analytical insights on how LNSS exponentially reduces the upper bound on the variances of Q value from a respective single step method

Long N-step Surrogate Stage Reward to Reduce Variances of Deep Reinforcement Learning in Complex Problems

TL;DR

This study introduces a new long -step surrogate stage (LNSS) reward approach to effectively account for complex environment dynamics while previous methods are usually feasible for limited number of steps and provides analytical insights on how LNSS exponentially reduces the upper bound on the variances of Q value from a respective single step method.

Abstract

High variances in reinforcement learning have shown impeding successful convergence and hurting task performance. As reward signal plays an important role in learning behavior, multi-step methods have been considered to mitigate the problem, and are believed to be more effective than single step methods. However, there is a lack of comprehensive and systematic study on this important aspect to demonstrate the effectiveness of multi-step methods in solving highly complex continuous control problems. In this study, we introduce a new long -step surrogate stage (LNSS) reward approach to effectively account for complex environment dynamics while previous methods are usually feasible for limited number of steps. The LNSS method is simple, low computational cost, and applicable to value based or policy gradient reinforcement learning. We systematically evaluate LNSS in OpenAI Gym and DeepMind Control Suite to address some complex benchmark environments that have been challenging to obtain good results by DRL in general. We demonstrate performance improvement in terms of total reward, convergence speed, and coefficient of variation (CV) by LNSS. We also provide analytical insights on how LNSS exponentially reduces the upper bound on the variances of Q value from a respective single step method
Paper Structure (18 sections, 34 equations, 8 figures, 2 tables, 1 algorithm)

This paper contains 18 sections, 34 equations, 8 figures, 2 tables, 1 algorithm.

Figures (8)

  • Figure 1: Variance discount factor $\psi$ in Equation (\ref{['Eq:psi']}).
  • Figure 2: Systematic evaluation of LNSS using several challenging continuous control tasks in OpenAI Gym and DMC. The shaded regions represents half a standard deviation of the average evaluation over 5 trials. The x-axis of the plots is the number of steps.
  • Figure 3: Performance comparison between LNSS (N=100) and mean reward method (n=100). All 6 tasks are considered for the two algorithms. Specifically, for each algorithm and each task, episode rewards are normalized to $[0,500]$. Then episode rewards for all six tasks for each algorithm are plotted together in one color. The shaded regions represent half a standard deviation of the average evaluation over 5 episodes. The x-axis of the plots is the number of steps. Detailed results for individual tasks are shown in Appendix \ref{['appendix:vs mean reward']}.
  • Figure 4: Q value Std percentage of all tested algorithms in DMC huamnoid walk task. The x-axis is the number of steps. For detailed results of other DMC tasks, refer to Figures \ref{['fig:variance full']} in Appendix \ref{['appendix:variance']}.
  • Figure 5: Episode rewards for the 3 tasks in DMC by LNSS with $n=1$ but different $N$ ($N = 5 , 50, 100$) in Equation (\ref{["Eq:r'"]}). The Episode rewards for each task are normalized to [0, 500]. The shaded regions represent half a standard deviation of the average evaluation scores over 5 trials. The x-axis is the number of steps. Additional details are provided in Appendix \ref{['appendix:N step']}.
  • ...and 3 more figures