Table of Contents
Fetching ...

Bellman Unbiasedness: Toward Provably Efficient Distributional Reinforcement Learning with General Value Function Approximation

Taehyun Cho, Seungyub Han, Seokhun Ju, Dohyeong Kim, Kyungjae Lee, Jungwoo Lee

TL;DR

This work tackles the challenge of distributional reinforcement learning (DistRL) under infinite-dimensional return distributions by introducing Bellman unbiasedness (BU) and focusing on moment functionals as the provably learnable representation. It shows that, within a broad class that includes nonlinear functionals, only moments can be exactly learnable and unbiasedly estimable, with quantiles failing to be BU or BC; this motivates a moment-based, finite-dimensional approach. The authors propose SF-LSVI, a Statistical Functional Least Square Value Iteration algorithm that performs online, moment-based TD updates under general value function approximation, achieving a regret bound of $\tilde{O}(d_E H^{3/2} \sqrt{K})$ and matching or improving upon prior results under weaker assumptions. Collectively, the paper provides a principled framework for provably efficient DistRL with GVFA by compressing infinite-dimensional distributions into a finite set of estimable moments, enabling rigorous regret guarantees and avoiding detrimental discretization gaps.

Abstract

Distributional reinforcement learning improves performance by capturing environmental stochasticity, but a comprehensive theoretical understanding of its effectiveness remains elusive. In addition, the intractable element of the infinite dimensionality of distributions has been overlooked. In this paper, we present a regret analysis of distributional reinforcement learning with general value function approximation in a finite episodic Markov decision process setting. We first introduce a key notion of $\textit{Bellman unbiasedness}$ which is essential for exactly learnable and provably efficient distributional updates in an online manner. Among all types of statistical functionals for representing infinite-dimensional return distributions, our theoretical results demonstrate that only moment functionals can exactly capture the statistical information. Secondly, we propose a provably efficient algorithm, $\texttt{SF-LSVI}$, that achieves a tight regret bound of $\tilde{O}(d_E H^{\frac{3}{2}}\sqrt{K})$ where $H$ is the horizon, $K$ is the number of episodes, and $d_E$ is the eluder dimension of a function class.

Bellman Unbiasedness: Toward Provably Efficient Distributional Reinforcement Learning with General Value Function Approximation

TL;DR

This work tackles the challenge of distributional reinforcement learning (DistRL) under infinite-dimensional return distributions by introducing Bellman unbiasedness (BU) and focusing on moment functionals as the provably learnable representation. It shows that, within a broad class that includes nonlinear functionals, only moments can be exactly learnable and unbiasedly estimable, with quantiles failing to be BU or BC; this motivates a moment-based, finite-dimensional approach. The authors propose SF-LSVI, a Statistical Functional Least Square Value Iteration algorithm that performs online, moment-based TD updates under general value function approximation, achieving a regret bound of and matching or improving upon prior results under weaker assumptions. Collectively, the paper provides a principled framework for provably efficient DistRL with GVFA by compressing infinite-dimensional distributions into a finite set of estimable moments, enabling rigorous regret guarantees and avoiding detrimental discretization gaps.

Abstract

Distributional reinforcement learning improves performance by capturing environmental stochasticity, but a comprehensive theoretical understanding of its effectiveness remains elusive. In addition, the intractable element of the infinite dimensionality of distributions has been overlooked. In this paper, we present a regret analysis of distributional reinforcement learning with general value function approximation in a finite episodic Markov decision process setting. We first introduce a key notion of which is essential for exactly learnable and provably efficient distributional updates in an online manner. Among all types of statistical functionals for representing infinite-dimensional return distributions, our theoretical results demonstrate that only moment functionals can exactly capture the statistical information. Secondly, we propose a provably efficient algorithm, , that achieves a tight regret bound of where is the horizon, is the number of episodes, and is the eluder dimension of a function class.
Paper Structure (37 sections, 18 theorems, 72 equations, 4 figures, 2 tables, 1 algorithm)

This paper contains 37 sections, 18 theorems, 72 equations, 4 figures, 2 tables, 1 algorithm.

Key Result

Theorem 3.3

Quantile functional cannot be Bellman closed under any additional sketch.

Figures (4)

  • Figure 1: Venn-Diagram of Statistical Functional Classes. The diagram illustrates categories of statistical functional. (Yellow $\cap$ Blue) Within the linear statistical functional class, rowland2019statistics showed that the only functionals satisfying Bellman closedness are moment functionals. (Red $\cap$ Blue) We extend this concept by introducing the notion of Bellman unbiasedness, which not only covers moment functionals but also includes central moment functionals from the broader class including nonlinear statistical functionals. (Yellow $\cap$$\text{Blue}^c$) According to Lemmas 3.2 and 4.4 of rowland2019statistics, categorical functionals are linear but not Bellman closed. (A) Maximum and minimum functionals are Bellman closed, while they are not unbiasedly estimatable. (B) Median and quantile functionals are neither Bellman closed nor unbiased, highlighting that they are not proper to encode the distribution in terms of exactness. The proofs corresponding to each region are provided in Appendix \ref{['Appendix: Related work and Discussion']}.
  • Figure 2: Illustrative representation of sketch-based Bellman updates for a mixture distribution. Instead of updating the distributions directly, each sampled distribution is embedded through a sketch $\psi$ (e.g., mean $\mu$, quantile $q_i$). The transformation $\phi_\psi$ aims to compress the mixture distribution into the same number of parameters, ensuring unbiasedness to prevent information loss.
  • Figure 3: Bellman Closedness
  • Figure 4: Bellman Unbiasedness

Theorems & Definitions (42)

  • Definition 3.1: Statistical functionals, Sketch; bellemare2023distributional
  • Definition 3.2: Bellman closedness; rowland2019statistics
  • Theorem 3.3
  • Definition 3.4: Bellman unbiasedness
  • Lemma 3.5
  • Theorem 3.6
  • Definition 3.8: Model Misspecification in distBC
  • Definition 5.1: $\epsilon$-dependent, $\epsilon$-independent, Eluder dimension for vector-valued function
  • Lemma 5.3: Single Step Optimization Error
  • Lemma 5.4: Confidence Region
  • ...and 32 more