Estimation and Inference in Distributional Reinforcement Learning

Liangyu Zhang; Yang Peng; Jiadong Liang; Wenhao Yang; Zhihua Zhang

Estimation and Inference in Distributional Reinforcement Learning

Liangyu Zhang, Yang Peng, Jiadong Liang, Wenhao Yang, Zhihua Zhang

TL;DR

This work develops a statistical theory for distributional reinforcement learning by introducing a certainty-equivalence estimator $\hat{\eta}^{\pi}$ built from a generative model to recover the full return distribution $\eta^{\pi}$. It proves non-asymptotic bounds in $W_p$, KS, and TV metrics with model-based sample complexities $\widetilde{O}\left(\frac{|\mathcal{S}| |\mathcal{A}|}{\varepsilon^{2p}(1-\gamma)^{2p+2}}\right)$ for $W_p$ and $\widetilde{O}\left(\frac{|\mathcal{S}| |\mathcal{A}|}{\varepsilon^{2}(1-\gamma)^{4}}\right)$ for KS/TV, and shows a functional central limit theorem yielding Gaussian process limits for $\sqrt{n}(\hat{\eta}^{\pi}-\eta^{\pi})$ in spaces tied to $W_1$, KS, and TV. It also develops extensions to offline data and unknown reward distributions, and provides a comprehensive inferential framework for Hadamard differentiable functionals of the return distribution, enabling confidence sets and plug-in quantiles. The numerical results validate linear convergence of distributional DP, finite-sample rates, and the validity of inferential procedures, highlighting practical impact for risk-sensitive and uncertainty-aware RL. Overall, the paper unifies distributional DP with statistical inference, offering precise sample-complexity guarantees and scalable procedures for uncertainty quantification in RL.

Abstract

In this paper, we study distributional reinforcement learning from the perspective of statistical efficiency. We investigate distributional policy evaluation, aiming to estimate the complete return distribution (denoted $η^π$) attained by a given policy $π$. We use the certainty-equivalence method to construct our estimator $\hatη^π$, given a generative model is available. In this circumstance we need a dataset of size $\widetilde O\left(\frac{|\mathcal{S}||\mathcal{A}|}{\varepsilon^{2p}(1-γ)^{2p+2}}\right)$ to guarantee the $p$-Wasserstein metric between $\hatη^π$ and $η^π$ less than $\varepsilon$ with high probability. This implies the distributional policy evaluation problem can be solved with sample efficiency. Also, we show that under different mild assumptions a dataset of size $\widetilde O\left(\frac{|\mathcal{S}||\mathcal{A}|}{\varepsilon^{2}(1-γ)^{4}}\right)$ suffices to ensure the Kolmogorov metric and total variation metric between $\hatη^π$ and $η^π$ is below $\varepsilon$ with high probability. Furthermore, we investigate the asymptotic behavior of $\hatη^π$. We demonstrate that the ``empirical process'' $\sqrt{n}(\hatη^π-η^π)$ converges weakly to a Gaussian process in the space of bounded functionals on Lipschitz function class $\ell^\infty(\mathcal{F}_{\text{W}})$, also in the space of bounded functionals on indicator function class $\ell^\infty(\mathcal{F}_{\text{KS}})$ and bounded measurable function class $\ell^\infty(\mathcal{F}_{\text{TV}})$ when some mild conditions hold. Our findings give rise to a unified approach to statistical inference of a wide class of statistical functionals of $η^π$.

Estimation and Inference in Distributional Reinforcement Learning

TL;DR

This work develops a statistical theory for distributional reinforcement learning by introducing a certainty-equivalence estimator

built from a generative model to recover the full return distribution

. It proves non-asymptotic bounds in

, KS, and TV metrics with model-based sample complexities

for

and

for KS/TV, and shows a functional central limit theorem yielding Gaussian process limits for

in spaces tied to

, KS, and TV. It also develops extensions to offline data and unknown reward distributions, and provides a comprehensive inferential framework for Hadamard differentiable functionals of the return distribution, enabling confidence sets and plug-in quantiles. The numerical results validate linear convergence of distributional DP, finite-sample rates, and the validity of inferential procedures, highlighting practical impact for risk-sensitive and uncertainty-aware RL. Overall, the paper unifies distributional DP with statistical inference, offering precise sample-complexity guarantees and scalable procedures for uncertainty quantification in RL.

Abstract

) attained by a given policy

. We use the certainty-equivalence method to construct our estimator

, given a generative model is available. In this circumstance we need a dataset of size

to guarantee the

-Wasserstein metric between

and

less than

with high probability. This implies the distributional policy evaluation problem can be solved with sample efficiency. Also, we show that under different mild assumptions a dataset of size

suffices to ensure the Kolmogorov metric and total variation metric between

and

is below

with high probability. Furthermore, we investigate the asymptotic behavior of

. We demonstrate that the ``empirical process''

converges weakly to a Gaussian process in the space of bounded functionals on Lipschitz function class

, also in the space of bounded functionals on indicator function class

and bounded measurable function class

when some mild conditions hold. Our findings give rise to a unified approach to statistical inference of a wide class of statistical functionals of

Paper Structure (39 sections, 56 theorems, 213 equations, 8 figures, 2 tables)

This paper contains 39 sections, 56 theorems, 213 equations, 8 figures, 2 tables.

Introduction
Our Contributions
Related Works
Distributional Reinforcement Learning
Statistical Inference in Reinforcement Learning
Perturbation Theory of Markov Chains
Preliminaries
Problem Setting and the Certainty-equivalence Estimator
Metrics on the Space of Measures
Distributional Bellman Operator and Distributional Dynamic Programming
Statistical Analysis
Results on Non-asymptotic Analysis
Extension I: Less Exploratory Offline Dataset
Extension II: Unknown Reward Distributions
Implications in Risk-sensitive RL
...and 24 more sections

Key Result

Proposition 2.1

ross2011fundamentals Assume that $\mu\in\Delta({\mathbb R})$ has finite moment and $\mu$ has a Lebesgue density $p_\mu$ that is bounded by $C$. Then for any $\nu\in\Delta({\mathbb R})$ with finite moment, $\textup{KS}(\mu,\nu)\leq\sqrt{2CW_1(\mu,\nu)}$.

Figures (8)

Figure 1: An illustration of two types of uncertainty in RL. Blue distribution: ground-truth return distribution with quantiles $q_{0.1}$ and $q_{0.9}$. Orange distribution: estimated return distribution with quantiles $\hat{q}_{0.1}$ and $\hat{q}_{0.9}$. Shaded area (A): intrinsic uncertainty in RL. Error (B): error caused by statistical uncertainty in RL.
Figure 2: Convergence of $\log W_1(\eta^{(t)}(s_1),\hat{\eta}^\pi(s_1))$, $\log\textup{KS}(\eta^{(t)}(s_1),\hat{\eta}^\pi(s_1))$, and $\log\textup{TV}(\eta^{(t)}(s_1),\hat{\eta}^\pi(s_1))$ with sample size $n=10000$ and $\gamma=0.9$. $t$ is the iteration number.
Figure 3: Two-phase convergence of $W_1(\eta^{(t)}(s_1),\eta^\pi(s_1))$ with different sample sizes. $t$ is the iteration number. From left to right: $\gamma=0.7$; $\gamma=0.8$; $\gamma=0.9$; $\gamma=0.97$.
Figure 4: Two-phase convergence of $\textup{KS}(\eta^{(t)}(s_1),\eta^\pi(s_1))$ with different sample seize. $t$ is the iteration number. From left to right: $\gamma=0.7$; $\gamma=0.8$; $\gamma=0.9$; $\gamma=0.97$.
Figure 5: Two-phase convergence of $\textup{TV}(\eta^{(t)}(s_1),\eta^\pi(s_1))$ with different sample seize. $t$ is the iteration number. From left to right: $\gamma=0.7$; $\gamma=0.8$; $\gamma=0.9$; $\gamma=0.97$.
...and 3 more figures

Theorems & Definitions (105)

Remark 2.1
Proposition 2.1
Proposition 2.2
Proposition 2.3
Proposition 2.4
Proposition 2.5
Theorem 3.1
Corollary 3.1
Theorem 3.2
Theorem 3.3
...and 95 more

Estimation and Inference in Distributional Reinforcement Learning

TL;DR

Abstract

Estimation and Inference in Distributional Reinforcement Learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (8)

Theorems & Definitions (105)