Bellman Unbiasedness: Toward Provably Efficient Distributional Reinforcement Learning with General Value Function Approximation
Taehyun Cho, Seungyub Han, Seokhun Ju, Dohyeong Kim, Kyungjae Lee, Jungwoo Lee
TL;DR
This work tackles the challenge of distributional reinforcement learning (DistRL) under infinite-dimensional return distributions by introducing Bellman unbiasedness (BU) and focusing on moment functionals as the provably learnable representation. It shows that, within a broad class that includes nonlinear functionals, only moments can be exactly learnable and unbiasedly estimable, with quantiles failing to be BU or BC; this motivates a moment-based, finite-dimensional approach. The authors propose SF-LSVI, a Statistical Functional Least Square Value Iteration algorithm that performs online, moment-based TD updates under general value function approximation, achieving a regret bound of $\tilde{O}(d_E H^{3/2} \sqrt{K})$ and matching or improving upon prior results under weaker assumptions. Collectively, the paper provides a principled framework for provably efficient DistRL with GVFA by compressing infinite-dimensional distributions into a finite set of estimable moments, enabling rigorous regret guarantees and avoiding detrimental discretization gaps.
Abstract
Distributional reinforcement learning improves performance by capturing environmental stochasticity, but a comprehensive theoretical understanding of its effectiveness remains elusive. In addition, the intractable element of the infinite dimensionality of distributions has been overlooked. In this paper, we present a regret analysis of distributional reinforcement learning with general value function approximation in a finite episodic Markov decision process setting. We first introduce a key notion of $\textit{Bellman unbiasedness}$ which is essential for exactly learnable and provably efficient distributional updates in an online manner. Among all types of statistical functionals for representing infinite-dimensional return distributions, our theoretical results demonstrate that only moment functionals can exactly capture the statistical information. Secondly, we propose a provably efficient algorithm, $\texttt{SF-LSVI}$, that achieves a tight regret bound of $\tilde{O}(d_E H^{\frac{3}{2}}\sqrt{K})$ where $H$ is the horizon, $K$ is the number of episodes, and $d_E$ is the eluder dimension of a function class.
