Table of Contents
Fetching ...

Near-Minimax-Optimal Distributional Reinforcement Learning with a Generative Model

Mark Rowland, Li Kevin Wenliang, Rémi Munos, Clare Lyle, Yunhao Tang, Will Dabney

TL;DR

A new algorithm is proposed for model-based distributional reinforcement learning (RL), and it is proved that it is minimax-optimal for approximating return distributions with a generative model (up to logarithmic factors), resolving an open question of Zhang et al. (2023).

Abstract

We propose a new algorithm for model-based distributional reinforcement learning (RL), and prove that it is minimax-optimal for approximating return distributions with a generative model (up to logarithmic factors), resolving an open question of Zhang et al. (2023). Our analysis provides new theoretical results on categorical approaches to distributional RL, and also introduces a new distributional Bellman equation, the stochastic categorical CDF Bellman equation, which we expect to be of independent interest. We also provide an experimental study comparing several model-based distributional RL algorithms, with several takeaways for practitioners.

Near-Minimax-Optimal Distributional Reinforcement Learning with a Generative Model

TL;DR

A new algorithm is proposed for model-based distributional reinforcement learning (RL), and it is proved that it is minimax-optimal for approximating return distributions with a generative model (up to logarithmic factors), resolving an open question of Zhang et al. (2023).

Abstract

We propose a new algorithm for model-based distributional reinforcement learning (RL), and prove that it is minimax-optimal for approximating return distributions with a generative model (up to logarithmic factors), resolving an open question of Zhang et al. (2023). Our analysis provides new theoretical results on categorical approaches to distributional RL, and also introduces a new distributional Bellman equation, the stochastic categorical CDF Bellman equation, which we expect to be of independent interest. We also provide an experimental study comparing several model-based distributional RL algorithms, with several takeaways for practitioners.
Paper Structure (39 sections, 31 theorems, 152 equations, 10 figures, 1 algorithm)

This paper contains 39 sections, 31 theorems, 152 equations, 10 figures, 1 algorithm.

Key Result

Proposition 2.2

rowland2018analysis. The operator ${\Pi_m} \mathcal{T} : \mathscr{P}([0,(1-\gamma)^{-1}])^\mathcal{X} \rightarrow \mathscr{P}([0,(1-\gamma)^{-1}])^\mathcal{X}$ is a contraction mapping with respect to $\overline{\ell}_2$, with contraction factor $\sqrt{\gamma}$, and has a unique fixed point, $\eta_\

Figures (10)

  • Figure 1: (a) The density of a distribution $\nu$ (grey), and its categorical projection ${\Pi_m} \nu \in \mathscr{P}(\{z_1,\ldots,z_m\})$ (blue). (b) A categorical distribution (blue); its update after being scaled by $\gamma$ and shifted by $r$ by the distributional Bellman operator $\mathcal{T}$, moving its support off the grid $\{z_1,\ldots,z_m\}$ (pink); the resulting realigned distribution supported on the grid $\{z_1,\ldots,z_m\}$ after projection via ${\Pi_m}$ (green). (c) Hat functions $h_i$ (solid) and $h_m$ (dashed).
  • Figure 2: Left: Example MRP with $r(x_0) = 1, r(x_1) = 0$, $\gamma = 0.9$. Right: Categorical fixed point $F^*(x_0)$ with $m=15$, and 5 independent samples from the random CDF $\Phi^*(x_0)$.
  • Figure 3: Approximation error/wallclock time for a variety of distributional RL methods, discount factors, numbers of atoms, and numbers of environment samples.
  • Figure 4: Monte Carlo approximations of return distributions in each of the four environments tested.
  • Figure 5: The function $z \mapsto \sum_{l \leq i} h_l(z)$ (grey), and a possible configuration for $r(x) + \gamma z_j$, $r(x) + \gamma z_{j+1}$ in the event of a non-zero $H^x_{i,j} - H^x_{i,j+1}$ term.
  • ...and 5 more figures

Theorems & Definitions (54)

  • Definition 2.1
  • Proposition 2.2
  • Proposition 4.0
  • Proposition 4.0
  • Proposition 4.0
  • Theorem 5.1
  • Lemma 5.1
  • Theorem 5.2
  • Definition 5.3
  • Definition 5.4
  • ...and 44 more