Estimation and Inference in Distributional Reinforcement Learning
Liangyu Zhang, Yang Peng, Jiadong Liang, Wenhao Yang, Zhihua Zhang
TL;DR
This work develops a statistical theory for distributional reinforcement learning by introducing a certainty-equivalence estimator $\hat{\eta}^{\pi}$ built from a generative model to recover the full return distribution $\eta^{\pi}$. It proves non-asymptotic bounds in $W_p$, KS, and TV metrics with model-based sample complexities $\widetilde{O}\left(\frac{|\mathcal{S}| |\mathcal{A}|}{\varepsilon^{2p}(1-\gamma)^{2p+2}}\right)$ for $W_p$ and $\widetilde{O}\left(\frac{|\mathcal{S}| |\mathcal{A}|}{\varepsilon^{2}(1-\gamma)^{4}}\right)$ for KS/TV, and shows a functional central limit theorem yielding Gaussian process limits for $\sqrt{n}(\hat{\eta}^{\pi}-\eta^{\pi})$ in spaces tied to $W_1$, KS, and TV. It also develops extensions to offline data and unknown reward distributions, and provides a comprehensive inferential framework for Hadamard differentiable functionals of the return distribution, enabling confidence sets and plug-in quantiles. The numerical results validate linear convergence of distributional DP, finite-sample rates, and the validity of inferential procedures, highlighting practical impact for risk-sensitive and uncertainty-aware RL. Overall, the paper unifies distributional DP with statistical inference, offering precise sample-complexity guarantees and scalable procedures for uncertainty quantification in RL.
Abstract
In this paper, we study distributional reinforcement learning from the perspective of statistical efficiency. We investigate distributional policy evaluation, aiming to estimate the complete return distribution (denoted $η^π$) attained by a given policy $π$. We use the certainty-equivalence method to construct our estimator $\hatη^π$, given a generative model is available. In this circumstance we need a dataset of size $\widetilde O\left(\frac{|\mathcal{S}||\mathcal{A}|}{\varepsilon^{2p}(1-γ)^{2p+2}}\right)$ to guarantee the $p$-Wasserstein metric between $\hatη^π$ and $η^π$ less than $\varepsilon$ with high probability. This implies the distributional policy evaluation problem can be solved with sample efficiency. Also, we show that under different mild assumptions a dataset of size $\widetilde O\left(\frac{|\mathcal{S}||\mathcal{A}|}{\varepsilon^{2}(1-γ)^{4}}\right)$ suffices to ensure the Kolmogorov metric and total variation metric between $\hatη^π$ and $η^π$ is below $\varepsilon$ with high probability. Furthermore, we investigate the asymptotic behavior of $\hatη^π$. We demonstrate that the ``empirical process'' $\sqrt{n}(\hatη^π-η^π)$ converges weakly to a Gaussian process in the space of bounded functionals on Lipschitz function class $\ell^\infty(\mathcal{F}_{\text{W}})$, also in the space of bounded functionals on indicator function class $\ell^\infty(\mathcal{F}_{\text{KS}})$ and bounded measurable function class $\ell^\infty(\mathcal{F}_{\text{TV}})$ when some mild conditions hold. Our findings give rise to a unified approach to statistical inference of a wide class of statistical functionals of $η^π$.
