Table of Contents
Fetching ...

Ergodicity in reinforcement learning

Dominik Baumann, Erfaun Noorani, Arsenii Mustafin, Xinyi Sheng, Bert Verbruggen, Arne Vanhoyweghen, Vincent Ginis, Thomas B. Schön

TL;DR

The impact of non-ergodic reward processes on reinforcement learning agents through an instructive example is discussed, the notion of ergodic reward processes to more widely used notions of ergodic Markov chains are related, and existing solutions that optimize long-term performance of individual trajectories under non-ergodic reward dynamics are presented.

Abstract

In reinforcement learning, we typically aim to optimize the expected value of the sum of rewards an agent collects over a trajectory. However, if the process generating these rewards is non-ergodic, the expected value, i.e., the average over infinitely many trajectories with a given policy, is uninformative for the average over a single, but infinitely long trajectory. Thus, if we care about how the individual agent performs during deployment, the expected value is not a good optimization objective. In this paper, we discuss the impact of non-ergodic reward processes on reinforcement learning agents through an instructive example, relate the notion of ergodic reward processes to more widely used notions of ergodic Markov chains, and present existing solutions that optimize long-term performance of individual trajectories under non-ergodic reward dynamics.

Ergodicity in reinforcement learning

TL;DR

The impact of non-ergodic reward processes on reinforcement learning agents through an instructive example is discussed, the notion of ergodic reward processes to more widely used notions of ergodic Markov chains are related, and existing solutions that optimize long-term performance of individual trajectories under non-ergodic reward dynamics are presented.

Abstract

In reinforcement learning, we typically aim to optimize the expected value of the sum of rewards an agent collects over a trajectory. However, if the process generating these rewards is non-ergodic, the expected value, i.e., the average over infinitely many trajectories with a given policy, is uninformative for the average over a single, but infinitely long trajectory. Thus, if we care about how the individual agent performs during deployment, the expected value is not a good optimization objective. In this paper, we discuss the impact of non-ergodic reward processes on reinforcement learning agents through an instructive example, relate the notion of ergodic reward processes to more widely used notions of ergodic Markov chains, and present existing solutions that optimize long-term performance of individual trajectories under non-ergodic reward dynamics.
Paper Structure (12 sections, 16 equations, 6 figures)

This paper contains 12 sections, 16 equations, 6 figures.

Figures (6)

  • Figure 1: The coin-toss example. Both with the policy analytically optimizing the expected return (left) and the RL policy (right), the agents end up with close to 0 return. The red lines mark the initial return, while the dashed blue line in the left plot shows the expected return. The other trajectories represent different realizations of the game. Both figures are taken from baumann2025reinforcement.
  • Figure 2: Possible realizations of the coin-toss example. We can see that after two iterations of the game, the agent wins on average, but there is only one out of four possible realizations that lead to such a winning outcome. Taken from baumann2025reinforcement.
  • Figure 3: Coin-toss with learned transformation. By learning on increments of transformed returns, we can learn a winning policy. The red line marks the initial return; the other colored plots represent different realizations of the game. Taken from baumann2025reinforcement.
  • Figure 4: Coin-toss with geometric mean estimation. By embedding the geometric mean estimator into the RL objective, we can learn a winning policy. The red line marks the initial return; the other colored plots represent different realizations.
  • Figure 5: Illustration of two policies for an agent in a coin toss experiment with two actions. Two different policies indicate the preference for an agent to take a safe action over a risky one, illustrated by the indifference points. An optimal policy changes the action preference based on the prediction of time growth ($p_\mathrm{T}$) rather than the expected value ($p_\mathrm{E}$).
  • ...and 1 more figures

Theorems & Definitions (9)

  • Definition 1: (Strong) Reward ergodicity
  • Definition 2: Asymptotic reward ergodicity
  • Remark 1
  • Definition 3: Ergodic Markov chain
  • Definition 4: Ergodic Markov reward process
  • proof
  • Remark 2
  • proof
  • Definition 5: Ergodic MDP puterman2014markov