Near-Minimax-Optimal Distributional Reinforcement Learning with a Generative Model

Mark Rowland; Li Kevin Wenliang; Rémi Munos; Clare Lyle; Yunhao Tang; Will Dabney

Near-Minimax-Optimal Distributional Reinforcement Learning with a Generative Model

Mark Rowland, Li Kevin Wenliang, Rémi Munos, Clare Lyle, Yunhao Tang, Will Dabney

TL;DR

A new algorithm is proposed for model-based distributional reinforcement learning (RL), and it is proved that it is minimax-optimal for approximating return distributions with a generative model (up to logarithmic factors), resolving an open question of Zhang et al. (2023).

Abstract

We propose a new algorithm for model-based distributional reinforcement learning (RL), and prove that it is minimax-optimal for approximating return distributions with a generative model (up to logarithmic factors), resolving an open question of Zhang et al. (2023). Our analysis provides new theoretical results on categorical approaches to distributional RL, and also introduces a new distributional Bellman equation, the stochastic categorical CDF Bellman equation, which we expect to be of independent interest. We also provide an experimental study comparing several model-based distributional RL algorithms, with several takeaways for practitioners.

Near-Minimax-Optimal Distributional Reinforcement Learning with a Generative Model

TL;DR

Abstract

Paper Structure (39 sections, 31 theorems, 152 equations, 10 figures, 1 algorithm)

This paper contains 39 sections, 31 theorems, 152 equations, 10 figures, 1 algorithm.

Introduction
Background
Reinforcement learning with a generative model
Distributional reinforcement learning
Categorical dynamic programming
Distributional reinforcement learning with a generative model
Direct categorical fixed-point computation
Direct categorical fixed-point computation
DCFP with a generative model
Sample complexity analysis
Structure of the proof of Theorem \ref{['thm:wasserstein']}
The stochastic categorical CDF Bellman equation
Empirical evaluation
Conclusion
Related work
...and 24 more sections

Key Result

Proposition 2.2

rowland2018analysis. The operator ${\Pi_m} \mathcal{T} : \mathscr{P}([0,(1-\gamma)^{-1}])^\mathcal{X} \rightarrow \mathscr{P}([0,(1-\gamma)^{-1}])^\mathcal{X}$ is a contraction mapping with respect to $\overline{\ell}_2$, with contraction factor $\sqrt{\gamma}$, and has a unique fixed point, $\eta_\

Figures (10)

Figure 1: (a) The density of a distribution $\nu$ (grey), and its categorical projection ${\Pi_m} \nu \in \mathscr{P}(\{z_1,\ldots,z_m\})$ (blue). (b) A categorical distribution (blue); its update after being scaled by $\gamma$ and shifted by $r$ by the distributional Bellman operator $\mathcal{T}$, moving its support off the grid $\{z_1,\ldots,z_m\}$ (pink); the resulting realigned distribution supported on the grid $\{z_1,\ldots,z_m\}$ after projection via ${\Pi_m}$ (green). (c) Hat functions $h_i$ (solid) and $h_m$ (dashed).
Figure 2: Left: Example MRP with $r(x_0) = 1, r(x_1) = 0$, $\gamma = 0.9$. Right: Categorical fixed point $F^*(x_0)$ with $m=15$, and 5 independent samples from the random CDF $\Phi^*(x_0)$.
Figure 3: Approximation error/wallclock time for a variety of distributional RL methods, discount factors, numbers of atoms, and numbers of environment samples.
Figure 4: Monte Carlo approximations of return distributions in each of the four environments tested.
Figure 5: The function $z \mapsto \sum_{l \leq i} h_l(z)$ (grey), and a possible configuration for $r(x) + \gamma z_j$, $r(x) + \gamma z_{j+1}$ in the event of a non-zero $H^x_{i,j} - H^x_{i,j+1}$ term.
...and 5 more figures

Theorems & Definitions (54)

Definition 2.1
Proposition 2.2
Proposition 4.0
Proposition 4.0
Proposition 4.0
Theorem 5.1
Lemma 5.1
Theorem 5.2
Definition 5.3
Definition 5.4
...and 44 more

Near-Minimax-Optimal Distributional Reinforcement Learning with a Generative Model

TL;DR

Abstract

Near-Minimax-Optimal Distributional Reinforcement Learning with a Generative Model

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (10)

Theorems & Definitions (54)