Bellman Unbiasedness: Toward Provably Efficient Distributional Reinforcement Learning with General Value Function Approximation

Taehyun Cho; Seungyub Han; Seokhun Ju; Dohyeong Kim; Kyungjae Lee; Jungwoo Lee

Bellman Unbiasedness: Toward Provably Efficient Distributional Reinforcement Learning with General Value Function Approximation

Taehyun Cho, Seungyub Han, Seokhun Ju, Dohyeong Kim, Kyungjae Lee, Jungwoo Lee

TL;DR

This work tackles the challenge of distributional reinforcement learning (DistRL) under infinite-dimensional return distributions by introducing Bellman unbiasedness (BU) and focusing on moment functionals as the provably learnable representation. It shows that, within a broad class that includes nonlinear functionals, only moments can be exactly learnable and unbiasedly estimable, with quantiles failing to be BU or BC; this motivates a moment-based, finite-dimensional approach. The authors propose SF-LSVI, a Statistical Functional Least Square Value Iteration algorithm that performs online, moment-based TD updates under general value function approximation, achieving a regret bound of $\tilde{O}(d_E H^{3/2} \sqrt{K})$ and matching or improving upon prior results under weaker assumptions. Collectively, the paper provides a principled framework for provably efficient DistRL with GVFA by compressing infinite-dimensional distributions into a finite set of estimable moments, enabling rigorous regret guarantees and avoiding detrimental discretization gaps.

Abstract

Distributional reinforcement learning improves performance by capturing environmental stochasticity, but a comprehensive theoretical understanding of its effectiveness remains elusive. In addition, the intractable element of the infinite dimensionality of distributions has been overlooked. In this paper, we present a regret analysis of distributional reinforcement learning with general value function approximation in a finite episodic Markov decision process setting. We first introduce a key notion of $\textit{Bellman unbiasedness}$ which is essential for exactly learnable and provably efficient distributional updates in an online manner. Among all types of statistical functionals for representing infinite-dimensional return distributions, our theoretical results demonstrate that only moment functionals can exactly capture the statistical information. Secondly, we propose a provably efficient algorithm, $\texttt{SF-LSVI}$, that achieves a tight regret bound of $\tilde{O}(d_E H^{\frac{3}{2}}\sqrt{K})$ where $H$ is the horizon, $K$ is the number of episodes, and $d_E$ is the eluder dimension of a function class.

Bellman Unbiasedness: Toward Provably Efficient Distributional Reinforcement Learning with General Value Function Approximation

TL;DR

and matching or improving upon prior results under weaker assumptions. Collectively, the paper provides a principled framework for provably efficient DistRL with GVFA by compressing infinite-dimensional distributions into a finite set of estimable moments, enabling rigorous regret guarantees and avoiding detrimental discretization gaps.

Abstract

which is essential for exactly learnable and provably efficient distributional updates in an online manner. Among all types of statistical functionals for representing infinite-dimensional return distributions, our theoretical results demonstrate that only moment functionals can exactly capture the statistical information. Secondly, we propose a provably efficient algorithm,

, that achieves a tight regret bound of

where

is the horizon,

is the number of episodes, and

is the eluder dimension of a function class.

Paper Structure (37 sections, 18 theorems, 72 equations, 4 figures, 2 tables, 1 algorithm)

This paper contains 37 sections, 18 theorems, 72 equations, 4 figures, 2 tables, 1 algorithm.

Infinite-dimensionality of distribution.
Online distributional update.
Related Work
Distributional RL.
RL with General Value Function Approximation.
DistRL with General Value Function Approximation.
Preliminaries
Episodic MDP.
Policy and Value Functions.
Random Variables and Distributions.
Distributional Bellman Optimality Equation.
Additional Notations.
Statistical Functionals in Distributional RL
Bellman Closedness
Bellman Unbiasedness
...and 22 more sections

Key Result

Theorem 3.3

Quantile functional cannot be Bellman closed under any additional sketch.

Figures (4)

Figure 1: Venn-Diagram of Statistical Functional Classes. The diagram illustrates categories of statistical functional. (Yellow $\cap$ Blue) Within the linear statistical functional class, rowland2019statistics showed that the only functionals satisfying Bellman closedness are moment functionals. (Red $\cap$ Blue) We extend this concept by introducing the notion of Bellman unbiasedness, which not only covers moment functionals but also includes central moment functionals from the broader class including nonlinear statistical functionals. (Yellow $\cap$$\text{Blue}^c$) According to Lemmas 3.2 and 4.4 of rowland2019statistics, categorical functionals are linear but not Bellman closed. (A) Maximum and minimum functionals are Bellman closed, while they are not unbiasedly estimatable. (B) Median and quantile functionals are neither Bellman closed nor unbiased, highlighting that they are not proper to encode the distribution in terms of exactness. The proofs corresponding to each region are provided in Appendix \ref{['Appendix: Related work and Discussion']}.
Figure 2: Illustrative representation of sketch-based Bellman updates for a mixture distribution. Instead of updating the distributions directly, each sampled distribution is embedded through a sketch $\psi$ (e.g., mean $\mu$, quantile $q_i$). The transformation $\phi_\psi$ aims to compress the mixture distribution into the same number of parameters, ensuring unbiasedness to prevent information loss.
Figure 3: Bellman Closedness
Figure 4: Bellman Unbiasedness

Theorems & Definitions (42)

Definition 3.1: Statistical functionals, Sketch; bellemare2023distributional
Definition 3.2: Bellman closedness; rowland2019statistics
Theorem 3.3
Definition 3.4: Bellman unbiasedness
Lemma 3.5
Theorem 3.6
Definition 3.8: Model Misspecification in distBC
Definition 5.1: $\epsilon$-dependent, $\epsilon$-independent, Eluder dimension for vector-valued function
Lemma 5.3: Single Step Optimization Error
Lemma 5.4: Confidence Region
...and 32 more

Bellman Unbiasedness: Toward Provably Efficient Distributional Reinforcement Learning with General Value Function Approximation

TL;DR

Abstract

Bellman Unbiasedness: Toward Provably Efficient Distributional Reinforcement Learning with General Value Function Approximation

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (42)