Table of Contents
Fetching ...

Estimation and Inference in Distributional Reinforcement Learning

Liangyu Zhang, Yang Peng, Jiadong Liang, Wenhao Yang, Zhihua Zhang

TL;DR

This work develops a statistical theory for distributional reinforcement learning by introducing a certainty-equivalence estimator $\hat{\eta}^{\pi}$ built from a generative model to recover the full return distribution $\eta^{\pi}$. It proves non-asymptotic bounds in $W_p$, KS, and TV metrics with model-based sample complexities $\widetilde{O}\left(\frac{|\mathcal{S}| |\mathcal{A}|}{\varepsilon^{2p}(1-\gamma)^{2p+2}}\right)$ for $W_p$ and $\widetilde{O}\left(\frac{|\mathcal{S}| |\mathcal{A}|}{\varepsilon^{2}(1-\gamma)^{4}}\right)$ for KS/TV, and shows a functional central limit theorem yielding Gaussian process limits for $\sqrt{n}(\hat{\eta}^{\pi}-\eta^{\pi})$ in spaces tied to $W_1$, KS, and TV. It also develops extensions to offline data and unknown reward distributions, and provides a comprehensive inferential framework for Hadamard differentiable functionals of the return distribution, enabling confidence sets and plug-in quantiles. The numerical results validate linear convergence of distributional DP, finite-sample rates, and the validity of inferential procedures, highlighting practical impact for risk-sensitive and uncertainty-aware RL. Overall, the paper unifies distributional DP with statistical inference, offering precise sample-complexity guarantees and scalable procedures for uncertainty quantification in RL.

Abstract

In this paper, we study distributional reinforcement learning from the perspective of statistical efficiency. We investigate distributional policy evaluation, aiming to estimate the complete return distribution (denoted $η^π$) attained by a given policy $π$. We use the certainty-equivalence method to construct our estimator $\hatη^π$, given a generative model is available. In this circumstance we need a dataset of size $\widetilde O\left(\frac{|\mathcal{S}||\mathcal{A}|}{\varepsilon^{2p}(1-γ)^{2p+2}}\right)$ to guarantee the $p$-Wasserstein metric between $\hatη^π$ and $η^π$ less than $\varepsilon$ with high probability. This implies the distributional policy evaluation problem can be solved with sample efficiency. Also, we show that under different mild assumptions a dataset of size $\widetilde O\left(\frac{|\mathcal{S}||\mathcal{A}|}{\varepsilon^{2}(1-γ)^{4}}\right)$ suffices to ensure the Kolmogorov metric and total variation metric between $\hatη^π$ and $η^π$ is below $\varepsilon$ with high probability. Furthermore, we investigate the asymptotic behavior of $\hatη^π$. We demonstrate that the ``empirical process'' $\sqrt{n}(\hatη^π-η^π)$ converges weakly to a Gaussian process in the space of bounded functionals on Lipschitz function class $\ell^\infty(\mathcal{F}_{\text{W}})$, also in the space of bounded functionals on indicator function class $\ell^\infty(\mathcal{F}_{\text{KS}})$ and bounded measurable function class $\ell^\infty(\mathcal{F}_{\text{TV}})$ when some mild conditions hold. Our findings give rise to a unified approach to statistical inference of a wide class of statistical functionals of $η^π$.

Estimation and Inference in Distributional Reinforcement Learning

TL;DR

This work develops a statistical theory for distributional reinforcement learning by introducing a certainty-equivalence estimator built from a generative model to recover the full return distribution . It proves non-asymptotic bounds in , KS, and TV metrics with model-based sample complexities for and for KS/TV, and shows a functional central limit theorem yielding Gaussian process limits for in spaces tied to , KS, and TV. It also develops extensions to offline data and unknown reward distributions, and provides a comprehensive inferential framework for Hadamard differentiable functionals of the return distribution, enabling confidence sets and plug-in quantiles. The numerical results validate linear convergence of distributional DP, finite-sample rates, and the validity of inferential procedures, highlighting practical impact for risk-sensitive and uncertainty-aware RL. Overall, the paper unifies distributional DP with statistical inference, offering precise sample-complexity guarantees and scalable procedures for uncertainty quantification in RL.

Abstract

In this paper, we study distributional reinforcement learning from the perspective of statistical efficiency. We investigate distributional policy evaluation, aiming to estimate the complete return distribution (denoted ) attained by a given policy . We use the certainty-equivalence method to construct our estimator , given a generative model is available. In this circumstance we need a dataset of size to guarantee the -Wasserstein metric between and less than with high probability. This implies the distributional policy evaluation problem can be solved with sample efficiency. Also, we show that under different mild assumptions a dataset of size suffices to ensure the Kolmogorov metric and total variation metric between and is below with high probability. Furthermore, we investigate the asymptotic behavior of . We demonstrate that the ``empirical process'' converges weakly to a Gaussian process in the space of bounded functionals on Lipschitz function class , also in the space of bounded functionals on indicator function class and bounded measurable function class when some mild conditions hold. Our findings give rise to a unified approach to statistical inference of a wide class of statistical functionals of .
Paper Structure (39 sections, 56 theorems, 213 equations, 8 figures, 2 tables)

This paper contains 39 sections, 56 theorems, 213 equations, 8 figures, 2 tables.

Key Result

Proposition 2.1

ross2011fundamentals Assume that $\mu\in\Delta({\mathbb R})$ has finite moment and $\mu$ has a Lebesgue density $p_\mu$ that is bounded by $C$. Then for any $\nu\in\Delta({\mathbb R})$ with finite moment, $\textup{KS}(\mu,\nu)\leq\sqrt{2CW_1(\mu,\nu)}$.

Figures (8)

  • Figure 1: An illustration of two types of uncertainty in RL. Blue distribution: ground-truth return distribution with quantiles $q_{0.1}$ and $q_{0.9}$. Orange distribution: estimated return distribution with quantiles $\hat{q}_{0.1}$ and $\hat{q}_{0.9}$. Shaded area (A): intrinsic uncertainty in RL. Error (B): error caused by statistical uncertainty in RL.
  • Figure 2: Convergence of $\log W_1(\eta^{(t)}(s_1),\hat{\eta}^\pi(s_1))$, $\log\textup{KS}(\eta^{(t)}(s_1),\hat{\eta}^\pi(s_1))$, and $\log\textup{TV}(\eta^{(t)}(s_1),\hat{\eta}^\pi(s_1))$ with sample size $n=10000$ and $\gamma=0.9$. $t$ is the iteration number.
  • Figure 3: Two-phase convergence of $W_1(\eta^{(t)}(s_1),\eta^\pi(s_1))$ with different sample sizes. $t$ is the iteration number. From left to right: $\gamma=0.7$; $\gamma=0.8$; $\gamma=0.9$; $\gamma=0.97$.
  • Figure 4: Two-phase convergence of $\textup{KS}(\eta^{(t)}(s_1),\eta^\pi(s_1))$ with different sample seize. $t$ is the iteration number. From left to right: $\gamma=0.7$; $\gamma=0.8$; $\gamma=0.9$; $\gamma=0.97$.
  • Figure 5: Two-phase convergence of $\textup{TV}(\eta^{(t)}(s_1),\eta^\pi(s_1))$ with different sample seize. $t$ is the iteration number. From left to right: $\gamma=0.7$; $\gamma=0.8$; $\gamma=0.9$; $\gamma=0.97$.
  • ...and 3 more figures

Theorems & Definitions (105)

  • Remark 2.1
  • Proposition 2.1
  • Proposition 2.2
  • Proposition 2.3
  • Proposition 2.4
  • Proposition 2.5
  • Theorem 3.1
  • Corollary 3.1
  • Theorem 3.2
  • Theorem 3.3
  • ...and 95 more