Table of Contents
Fetching ...

A Differential Perspective on Distributional Reinforcement Learning

Juan Sebastian Rojas, Chi-Guhn Lee

TL;DR

This work extends distributional reinforcement learning from the discounted to the average-reward setting by proposing a quantile-based framework that learns the limiting per-step reward distribution and its relation to the long-run average reward. It introduces the D2 family of algorithms to learn the per-step reward distribution in average-reward RL and the D3 extension to also learn the differential return distribution, with theoretical convergence guarantees under standard unichain/communicating assumptions. Empirically, the methods yield competitive or superior performance compared to non-distributional baselines and provide richer information about long-run reward distributions in both toy settings and high-dimensional Atari environments. The approach offers improved scalability by parameterizing distributions with a modest number of quantiles, enabling practical distributional insights for continuing tasks and broad potential impact in RL applications requiring long-horizon risk and variability understanding.

Abstract

To date, distributional reinforcement learning (distributional RL) methods have exclusively focused on the discounted setting, where an agent aims to optimize a discounted sum of rewards over time. In this work, we extend distributional RL to the average-reward setting, where an agent aims to optimize the reward received per time step. In particular, we utilize a quantile-based approach to develop the first set of algorithms that can successfully learn and/or optimize the long-run per-step reward distribution, as well as the differential return distribution of an average-reward MDP. We derive proven-convergent tabular algorithms for both prediction and control, as well as a broader family of algorithms that have appealing scaling properties. Empirically, we find that these algorithms yield competitive and sometimes superior performance when compared to their non-distributional equivalents, while also capturing rich information about the long-run per-step reward and differential return distributions.

A Differential Perspective on Distributional Reinforcement Learning

TL;DR

This work extends distributional reinforcement learning from the discounted to the average-reward setting by proposing a quantile-based framework that learns the limiting per-step reward distribution and its relation to the long-run average reward. It introduces the D2 family of algorithms to learn the per-step reward distribution in average-reward RL and the D3 extension to also learn the differential return distribution, with theoretical convergence guarantees under standard unichain/communicating assumptions. Empirically, the methods yield competitive or superior performance compared to non-distributional baselines and provide richer information about long-run reward distributions in both toy settings and high-dimensional Atari environments. The approach offers improved scalability by parameterizing distributions with a modest number of quantiles, enabling practical distributional insights for continuing tasks and broad potential impact in RL applications requiring long-horizon risk and variability understanding.

Abstract

To date, distributional reinforcement learning (distributional RL) methods have exclusively focused on the discounted setting, where an agent aims to optimize a discounted sum of rewards over time. In this work, we extend distributional RL to the average-reward setting, where an agent aims to optimize the reward received per time step. In particular, we utilize a quantile-based approach to develop the first set of algorithms that can successfully learn and/or optimize the long-run per-step reward distribution, as well as the differential return distribution of an average-reward MDP. We derive proven-convergent tabular algorithms for both prediction and control, as well as a broader family of algorithms that have appealing scaling properties. Empirically, we find that these algorithms yield competitive and sometimes superior performance when compared to their non-distributional equivalents, while also capturing rich information about the long-run per-step reward and differential return distributions.

Paper Structure

This paper contains 27 sections, 19 theorems, 92 equations, 12 figures, 3 tables, 13 algorithms.

Key Result

Proposition 4.1

The limiting per-step reward distribution is the natural distributional objective in the average-reward setting, given that its mean yields the long-run average-reward, which is the primary prediction and control objective of (non-distributional) average-reward RL.

Figures (12)

  • Figure 1: Illustration of the agent-environment interaction in an average-reward MDP. As $t \to \infty$, following policy $\pi$ yields a limiting per-step reward distribution, $\phi_{\pi}$, with an average-reward, $\bar{r}_{\pi}$. Standard average-reward RL methods aim to learn and/or optimize the average-reward, $\bar{r}_{\pi}$. By contrast, the differential distributional RL methods proposed in this work aim to learn and/or optimize the limiting per-step reward distribution, $\phi_{\pi}$.
  • Figure 2: a) Histogram showing the empirical ($\varepsilon$-greedy) optimal (long-run) per-step reward distribution in the red-pill blue-pill task. b) Quantiles of the optimal per-step reward distribution in the red-pill blue-pill task. c) Convergence plot of the per-step reward quantile estimates as learning progresses when using the D2 Q-learning algorithm in the red-pill blue-pill task.
  • Figure 3: Rolling average-reward when using the D2 and D3 algorithms vs. a non-distributional Differential algorithm in the red-pill blue-pill environment. A solid line denotes the mean average-reward, and the corresponding shaded region denotes a 95% confidence interval over 50 runs.
  • Figure 4: Rolling averages of the total reward per episode when using the D2 and D3 algorithms vs. non-distributional Differential algorithms in the a)Breakout, b)BeamRider, and c)Freeway Atari 2600 environments. A solid line denotes the mean total reward per episode, and the corresponding shaded region denotes a 95% confidence interval over 8 runs.
  • Figure C.1: An illustration of the a) red-pill blue-pill, and b) inverted pendulum environments.
  • ...and 7 more figures

Theorems & Definitions (38)

  • Definition 3.1: Rowland2024-sg
  • Definition 3.2: Bellemare2023-mn
  • Proposition 4.1
  • Lemma 4.2
  • proof
  • Theorem 4.3
  • proof
  • Theorem 4.4
  • proof
  • Definition B.1.3: Differential Inclusion
  • ...and 28 more