On Policy Evaluation Algorithms in Distributional Reinforcement Learning

Julian Gerstenberg; Ralph Neininger; Denis Spiegel

On Policy Evaluation Algorithms in Distributional Reinforcement Learning

Julian Gerstenberg, Ralph Neininger, Denis Spiegel

TL;DR

This work develops universal distributional dynamic programming (DDP) methods for policy evaluation in distributional reinforcement learning, accommodating arbitrary reward distributions including heavy-tailed and unbounded cases. It introduces a general, parameterised DDP framework with projections and a sequence of updates that can grow representationally across iterations, providing concrete algorithms (PPA, ADP, QSP) and a rigorous error-analysis via accumulated projection error and contraction properties in Wasserstein-type metrics. The paper establishes quantitative bounds in Wasserstein and related metrics, extends bounds to Kolmogorov–Smirnov distance under suitable regularity, and develops density-approximation tools with uniform guarantees. Controlled experiments demonstrate the practical advantage of the proposed black-box DDP schemes over naive Monte Carlo in representative settings, and the appendices supply the technical foundations for the metric contractions and density results. Overall, the results offer a robust, broadly applicable toolkit for distributional policy evaluation in diverse MDPs, including those with complex, unbounded reward structures.

Abstract

We introduce a novel class of algorithms to efficiently approximate the unknown return distributions in policy evaluation problems from distributional reinforcement learning (DRL). The proposed distributional dynamic programming algorithms are suitable for underlying Markov decision processes (MDPs) having an arbitrary probabilistic reward mechanism, including continuous reward distributions with unbounded support being potentially heavy-tailed. For a plain instance of our proposed class of algorithms we prove error bounds, both within Wasserstein and Kolmogorov--Smirnov distances. Furthermore, for return distributions having probability density functions the algorithms yield approximations for these densities; error bounds are given within supremum norm. We introduce the concept of quantile-spline discretizations to come up with algorithms showing promising results in simulation experiments. While the performance of our algorithms can rigorously be analysed they can be seen as universal black box algorithms applicable to a large class of MDPs. We also derive new properties of probability metrics commonly used in DRL on which our quantitative analysis is based.

On Policy Evaluation Algorithms in Distributional Reinforcement Learning

TL;DR

Abstract

Paper Structure (22 sections, 21 theorems, 87 equations, 4 figures)

This paper contains 22 sections, 21 theorems, 87 equations, 4 figures.

Introduction
Notations
Distributional Bellman operator and dynamic programming
A general DDP Framework
Bounding the approximation error
The Accumulated Projection Error
Fixed Non-Expansive Projection
A class of DDP algorithms
Plain parameter algorithm (PPA)
Adaptive Plain Parameter Algorithm
Quantile-Spline Parameter Algorithm
Controlled Experiment
Kolmogorov--Smirnov distance and density approximation
Uniform Bounds
Sufficient criteria for existence of return densities with properties
...and 7 more sections

Key Result

Theorem 2

Under (A1) it holds that

Figures (4)

Figure 1: (Partially) overlapping curves $T\mapsto \log(e(n(T)))$ for $\gamma=0.7$ and different size functions; $M(k)=\lceil (1/\theta)^k\rceil$ with $\theta\in[0.75,0.9]$ and $M(k)\equiv m$ constant with $m\in\{50,51,\dots,2000\}$. For each choice of $M$ the curve has a single color.
Figure 2: CDFs of $\mu$ (blue) and $\Pi(\mu,\xi)$ (red) with $\xi=(x_1,\dots,x_8,y_1,\dots,y_7)\in\Xi_8$. The support of $\Pi(\mu,\xi)$ is the compact interval $[z(\xi)-w(\xi),z(\xi)+w(\xi)] = [x_1,x_8]$ and $\delta(\xi) = \max_{2\leq i\leq 8}|x_i-x_{i-1}|$.
Figure 3: Results for mdp (i), where size is the number of stored particles, resp. the number of stored samples in MC estimation. Calculating the next approximation of ADP and QSP exceeds $45$ seconds.
Figure 4: Results for mdp (ii). Since $\mathrm{Cauchy}(\mu,s)\notin\mathscr{P}_1(\mathbb{R})$, the $w_1$-distances are infinite. However, $\mathrm{Cauchy}(\mu,s)\in \mathscr{P}_{1/2}(\mathbb{R}) \subsetneq \mathscr{P}_{\ell_2}(\mathbb{R})$, thus $\ell_2$-distances are finite.

Theorems & Definitions (37)

Definition 1
Theorem 2
Remark 3: State-Action return distributions
Remark 4: Moment Assumptions
Example 1: Quantile Dynamic Programming, QDP
Theorem 5: Theorem 4.25 of bdr2023
Remark 6
Proposition 7
Lemma 8
Example 2: Analysing QDP with respect to $\mathrm{d}=w_1$
...and 27 more

On Policy Evaluation Algorithms in Distributional Reinforcement Learning

TL;DR

Abstract

On Policy Evaluation Algorithms in Distributional Reinforcement Learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (37)