Reinforcement learning with non-ergodic reward increments: robustness via ergodicity transformations
Dominik Baumann, Erfaun Noorani, James Price, Ole Peters, Colm Connaughton, Thomas B. Schön
TL;DR
This paper addresses the robustness gap in RL arising from non-ergodic reward dynamics by introducing a data-driven ergodicity transformation that makes return increments behave ergodically, enabling standard RL methods to optimize the long-term performance of single trajectories. The core idea is to learn a variance-stabilizing transform $h(R)$ so that increments $\Delta h(R)$ approximate Brownian motion with drift, aligning time-averaged growth with ensemble expectations. The authors demonstrate the approach on a heavy-tailed coin-toss example and on standard benchmarks, showing that transformed rewards yield more robust policies and long-term growth (e.g., via a learned/logarithmic transform and Kelly-criterion insights). They also connect their framework to risk-sensitive RL, analyze exponential transforms, and present a proof-of-concept using ergodic REINFORCE on cart-pole and reacher, highlighting improved robustness and generalization. Limitations include incremental-update settings, possible state-dependent transforms, and multi-agent extensions, with promising future work directions in learning dynamics and discounting interactions.
Abstract
Envisioned application areas for reinforcement learning (RL) include autonomous driving, precision agriculture, and finance, which all require RL agents to make decisions in the real world. A significant challenge hindering the adoption of RL methods in these domains is the non-robustness of conventional algorithms. In particular, the focus of RL is typically on the expected value of the return. The expected value is the average over the statistical ensemble of infinitely many trajectories, which can be uninformative about the performance of the average individual. For instance, when we have a heavy-tailed return distribution, the ensemble average can be dominated by rare extreme events. Consequently, optimizing the expected value can lead to policies that yield exceptionally high returns with a probability that approaches zero but almost surely result in catastrophic outcomes in single long trajectories. In this paper, we develop an algorithm that lets RL agents optimize the long-term performance of individual trajectories. The algorithm enables the agents to learn robust policies, which we show in an instructive example with a heavy-tailed return distribution and standard RL benchmarks. The key element of the algorithm is a transformation that we learn from data. This transformation turns the time series of collected returns into one for whose increments expected value and the average over a long trajectory coincide. Optimizing these increments results in robust policies.
