Policy Optimization in a Noisy Neighborhood: On Return Landscapes in Continuous Control
Nate Rahn, Pierluca D'Oro, Harley Wiltzer, Pierre-Luc Bacon, Marc G. Bellemare
TL;DR
This work investigates why deep RL in continuous control is unstable by examining the return landscape $R(\boldsymbol{\theta})$, the mapping from policy parameters to returns from a fixed initial state. By adopting a distributional view and focusing on post-update return distributions, it uncovers noisy neighborhoods, long left tails, and failure-prone regions, and introduces left-tail probability and CVaR-based rejection to navigate toward smoother regions. It further shows that policies from the same training run can be connected by simple, valley-free paths, while cross-run interpolations may encounter low-return regions, suggesting both fragility and opportunities for stabilization. The proposed distribution-aware approach improves robustness with modest computation by selectively rejecting updates that would reduce stability, offering a new lens for evaluation and design of reliable continuous-control RL systems.
Abstract
Deep reinforcement learning agents for continuous control are known to exhibit significant instability in their performance over time. In this work, we provide a fresh perspective on these behaviors by studying the return landscape: the mapping between a policy and a return. We find that popular algorithms traverse noisy neighborhoods of this landscape, in which a single update to the policy parameters leads to a wide range of returns. By taking a distributional view of these returns, we map the landscape, characterizing failure-prone regions of policy space and revealing a hidden dimension of policy quality. We show that the landscape exhibits surprising structure by finding simple paths in parameter space which improve the stability of a policy. To conclude, we develop a distribution-aware procedure which finds such paths, navigating away from noisy neighborhoods in order to improve the robustness of a policy. Taken together, our results provide new insight into the optimization, evaluation, and design of agents.
