Table of Contents
Fetching ...

Colored Noise in PPO: Improved Exploration and Performance through Correlated Action Sampling

Jakob Hollenstein, Georg Martius, Justus Piater

TL;DR

This paper investigates whether introducing temporally colored action noise into PPO can enhance exploration and learning in an on-policy setting. By parameterizing noise with a color factor $β$ in a Gaussian reparameterization, the authors show that an intermediate color ($β=0.5$) often yields the best average performance across a broad suite of environments, with performance improving as the update dataset size grows. They also demonstrate that four parallel collection environments (yielding roughly 8192 samples per update under their settings) provides a favorable trade-off between exploration and data efficiency, and that larger update sizes interact with noise color in a way that tends to favor more correlated noise. Overall, the work recommends adopting colored noise with $β=0.5$ as a default in PPO to boost exploration and performance, with some evidence of transfer potential to other on-policy methods.

Abstract

Proximal Policy Optimization (PPO), a popular on-policy deep reinforcement learning method, employs a stochastic policy for exploration. In this paper, we propose a colored noise-based stochastic policy variant of PPO. Previous research highlighted the importance of temporal correlation in action noise for effective exploration in off-policy reinforcement learning. Building on this, we investigate whether correlated noise can also enhance exploration in on-policy methods like PPO. We discovered that correlated noise for action selection improves learning performance and outperforms the currently popular uncorrelated white noise approach in on-policy methods. Unlike off-policy learning, where pink noise was found to be highly effective, we found that a colored noise, intermediate between white and pink, performed best for on-policy learning in PPO. We examined the impact of varying the amount of data collected for each update by modifying the number of parallel simulation environments for data collection and observed that with a larger number of parallel environments, more strongly correlated noise is beneficial. Due to the significant impact and ease of implementation, we recommend switching to correlated noise as the default noise source in PPO.

Colored Noise in PPO: Improved Exploration and Performance through Correlated Action Sampling

TL;DR

This paper investigates whether introducing temporally colored action noise into PPO can enhance exploration and learning in an on-policy setting. By parameterizing noise with a color factor in a Gaussian reparameterization, the authors show that an intermediate color () often yields the best average performance across a broad suite of environments, with performance improving as the update dataset size grows. They also demonstrate that four parallel collection environments (yielding roughly 8192 samples per update under their settings) provides a favorable trade-off between exploration and data efficiency, and that larger update sizes interact with noise color in a way that tends to favor more correlated noise. Overall, the work recommends adopting colored noise with as a default in PPO to boost exploration and performance, with some evidence of transfer potential to other on-policy methods.

Abstract

Proximal Policy Optimization (PPO), a popular on-policy deep reinforcement learning method, employs a stochastic policy for exploration. In this paper, we propose a colored noise-based stochastic policy variant of PPO. Previous research highlighted the importance of temporal correlation in action noise for effective exploration in off-policy reinforcement learning. Building on this, we investigate whether correlated noise can also enhance exploration in on-policy methods like PPO. We discovered that correlated noise for action selection improves learning performance and outperforms the currently popular uncorrelated white noise approach in on-policy methods. Unlike off-policy learning, where pink noise was found to be highly effective, we found that a colored noise, intermediate between white and pink, performed best for on-policy learning in PPO. We examined the impact of varying the amount of data collected for each update by modifying the number of parallel simulation environments for data collection and observed that with a larger number of parallel environments, more strongly correlated noise is beneficial. Due to the significant impact and ease of implementation, we recommend switching to correlated noise as the default noise source in PPO.
Paper Structure (28 sections, 2 equations, 17 figures, 6 tables, 1 algorithm)

This paper contains 28 sections, 2 equations, 17 figures, 6 tables, 1 algorithm.

Figures (17)

  • Figure 1: Two-dimensional random walks caused by colored noise of different $\beta$. Lower $\beta$ values cause more energy in high frequency parts of the power spectral density, causing the random walk to change direction more frequently and thus causing more local and less global exploration. Higher values of $\beta$ result in more energy in the lower frequencies of the power spectral density. This translates to random walks that change direction less frequently, and thus explore more globally.
  • Figure 2: Benchmark environments: (top) Mountain Car, Cartpole Balance, Cartpole Swingup, Ball in Cup (Catch), (middle) Hopper Hop, Cheetah Run, Walker Run, Reacher Hard, Pendulum Swingup, (bottom) Door, UMaze Point, UMaze Ant, 4 Rooms Point, UMaze Swimmer, 4 Rooms Swimmer, 4 Rooms Ant
  • Figure 3: Performance averaged across Environments: Correlated noise $\beta = 0.5$ significantly outperforms the default white noise ($\beta=0$, Sec. \ref{['sec:beta0vanilla']}) used by PPO. The bars indicate the $95\%$ bootstrapped confidence intervals.
  • Figure 4: Performance averaged across environments and noise colors: the number of parallel data collection environments has a significant impact on the performance. Bootstrapped $95\%$ confidence intervals for the mean are shown. With $N_\textrm{env}=4$ achieving the highest performance, though not significantly outperforming $N_\textrm{env}=2$
  • Figure 5: Preferred noise color depends on number of environments: Average performance across environment, impact of noise color $\beta$ combined with $\textrm{n-envs}$ number of parallel environments: (\ref{['subfig:marker_size_average_performance']}) A trend is visible: the averaged performance is larger for larger $\beta$ when more collection environments are used, but the decline due to the number of environments outweighs this trend. (\ref{['subfig:marker_size_rank_performance']}) Ranks of average performance are indicated by circle size, ranks are calculated across noise-colors but within the same number of environments. The positive trend between number of environments and larger $\beta$ is clearly visible.
  • ...and 12 more figures