Table of Contents
Fetching ...

Policy Optimization in a Noisy Neighborhood: On Return Landscapes in Continuous Control

Nate Rahn, Pierluca D'Oro, Harley Wiltzer, Pierre-Luc Bacon, Marc G. Bellemare

TL;DR

This work investigates why deep RL in continuous control is unstable by examining the return landscape $R(\boldsymbol{\theta})$, the mapping from policy parameters to returns from a fixed initial state. By adopting a distributional view and focusing on post-update return distributions, it uncovers noisy neighborhoods, long left tails, and failure-prone regions, and introduces left-tail probability and CVaR-based rejection to navigate toward smoother regions. It further shows that policies from the same training run can be connected by simple, valley-free paths, while cross-run interpolations may encounter low-return regions, suggesting both fragility and opportunities for stabilization. The proposed distribution-aware approach improves robustness with modest computation by selectively rejecting updates that would reduce stability, offering a new lens for evaluation and design of reliable continuous-control RL systems.

Abstract

Deep reinforcement learning agents for continuous control are known to exhibit significant instability in their performance over time. In this work, we provide a fresh perspective on these behaviors by studying the return landscape: the mapping between a policy and a return. We find that popular algorithms traverse noisy neighborhoods of this landscape, in which a single update to the policy parameters leads to a wide range of returns. By taking a distributional view of these returns, we map the landscape, characterizing failure-prone regions of policy space and revealing a hidden dimension of policy quality. We show that the landscape exhibits surprising structure by finding simple paths in parameter space which improve the stability of a policy. To conclude, we develop a distribution-aware procedure which finds such paths, navigating away from noisy neighborhoods in order to improve the robustness of a policy. Taken together, our results provide new insight into the optimization, evaluation, and design of agents.

Policy Optimization in a Noisy Neighborhood: On Return Landscapes in Continuous Control

TL;DR

This work investigates why deep RL in continuous control is unstable by examining the return landscape , the mapping from policy parameters to returns from a fixed initial state. By adopting a distributional view and focusing on post-update return distributions, it uncovers noisy neighborhoods, long left tails, and failure-prone regions, and introduces left-tail probability and CVaR-based rejection to navigate toward smoother regions. It further shows that policies from the same training run can be connected by simple, valley-free paths, while cross-run interpolations may encounter low-return regions, suggesting both fragility and opportunities for stabilization. The proposed distribution-aware approach improves robustness with modest computation by selectively rejecting updates that would reduce stability, offering a new lens for evaluation and design of reliable continuous-control RL systems.

Abstract

Deep reinforcement learning agents for continuous control are known to exhibit significant instability in their performance over time. In this work, we provide a fresh perspective on these behaviors by studying the return landscape: the mapping between a policy and a return. We find that popular algorithms traverse noisy neighborhoods of this landscape, in which a single update to the policy parameters leads to a wide range of returns. By taking a distributional view of these returns, we map the landscape, characterizing failure-prone regions of policy space and revealing a hidden dimension of policy quality. We show that the landscape exhibits surprising structure by finding simple paths in parameter space which improve the stability of a policy. To conclude, we develop a distribution-aware procedure which finds such paths, navigating away from noisy neighborhoods in order to improve the robustness of a policy. Taken together, our results provide new insight into the optimization, evaluation, and design of agents.
Paper Structure (24 sections, 5 equations, 28 figures, 2 tables, 1 algorithm)

This paper contains 24 sections, 5 equations, 28 figures, 2 tables, 1 algorithm.

Figures (28)

  • Figure 1: A visualization for two policies visited by SAC in the hopper environment. We show the return landscape in their proximity, their post-update return distributions, and the visual appearance of their learned gaits. We plot the mean of each return distribution as an orange line. Despite featuring a similar level of return, we observe that the policy in the noisy neighborhood performs an unstable curved gait which is faster but more prone to failure, as visible in the thick left tail of the post-update return distribution.
  • Figure 1: Post-Update-CVaR Rejection
  • Figure 2: A scatter plot showing mean return and standard deviation, skewness or left-tail probability of the post-update return distribution of policies produced by three popular deep RL algorithms on the ant Brax task. Each point corresponds to a given policy's post-update return distribution, with six selected policies highlighted by star markers showing a range of diverse distributions.
  • Figure 3: A visualization of how failures occur in the halfcheetah and walker2d tasks. The left subplots compare the reward-per-timestep obtained by a successful and failing trajectory generated by two policies in the same noisy neighborhood. The right subplots show the simultaneous evolution of returns for 10 such trajectory pairs (that can be thought of as a race to collect the most rewards), with the trajectory pair from the left indicated by a matching star marker. The right subplots indicate that policies from the same neighborhood behave similarly (diagonal segments of the curve) until the failing policy makes a sudden misstep and collects low rewards (horizontal segments).
  • Figure 4: The trajectory of a successful (top) and failing (bottom) policy, both coming from the same post-update distribution in walker2d. They exhibit a similar gait until right before the failure.
  • ...and 23 more figures

Theorems & Definitions (2)

  • Definition 2.1: Return Landscape
  • Definition 3.1: Post-Update Return