A Differential Perspective on Distributional Reinforcement Learning
Juan Sebastian Rojas, Chi-Guhn Lee
TL;DR
This work extends distributional reinforcement learning from the discounted to the average-reward setting by proposing a quantile-based framework that learns the limiting per-step reward distribution and its relation to the long-run average reward. It introduces the D2 family of algorithms to learn the per-step reward distribution in average-reward RL and the D3 extension to also learn the differential return distribution, with theoretical convergence guarantees under standard unichain/communicating assumptions. Empirically, the methods yield competitive or superior performance compared to non-distributional baselines and provide richer information about long-run reward distributions in both toy settings and high-dimensional Atari environments. The approach offers improved scalability by parameterizing distributions with a modest number of quantiles, enabling practical distributional insights for continuing tasks and broad potential impact in RL applications requiring long-horizon risk and variability understanding.
Abstract
To date, distributional reinforcement learning (distributional RL) methods have exclusively focused on the discounted setting, where an agent aims to optimize a discounted sum of rewards over time. In this work, we extend distributional RL to the average-reward setting, where an agent aims to optimize the reward received per time step. In particular, we utilize a quantile-based approach to develop the first set of algorithms that can successfully learn and/or optimize the long-run per-step reward distribution, as well as the differential return distribution of an average-reward MDP. We derive proven-convergent tabular algorithms for both prediction and control, as well as a broader family of algorithms that have appealing scaling properties. Empirically, we find that these algorithms yield competitive and sometimes superior performance when compared to their non-distributional equivalents, while also capturing rich information about the long-run per-step reward and differential return distributions.
