Distributional Off-policy Evaluation with Bellman Residual Minimization

Sungee Hong; Zhengling Qi; Raymond K. W. Wong

Distributional Off-policy Evaluation with Bellman Residual Minimization

Sungee Hong, Zhengling Qi, Raymond K. W. Wong

TL;DR

This work tackles distributional off-policy evaluation with offline data by replacing supremum-based distributional distances with expectation-based distances, specifically leveraging energy distance. The authors introduce Energy Bellman Residual Minimizer (EBRM), a Bellman-residual minimization framework that estimates the target return distribution Υ_π by optimizing over a parametric family Υ_θ using the energy-distance Bellman residual, and provide finite-sample guarantees under realizability. To address non-realizable settings, they develop a multi-step extension and practical estimators (splitting and bootstrap) with theoretical risk bounds, showing improvements over prior methods that rely on completeness. Empirical results on OpenAI Gym tasks demonstrate strong performance and robustness to misspecification, highlighting the method’s practical value for offline, distributional RL and risk-sensitive decision making.

Abstract

We study distributional off-policy evaluation (OPE), of which the goal is to learn the distribution of the return for a target policy using offline data generated by a different policy. The theoretical foundation of many existing work relies on the supremum-extended statistical distances such as supremum-Wasserstein distance, which are hard to estimate. In contrast, we study the more manageable expectation-extended statistical distances and provide a novel theoretical justification on their validity for learning the return distribution. Based on this attractive property, we propose a new method called Energy Bellman Residual Minimizer (EBRM) for distributional OPE. We provide corresponding in-depth theoretical analyses. We establish a finite-sample error bound for the EBRM estimator under the realizability assumption. Furthermore, we introduce a variant of our method based on a multi-step extension which improves the error bound for non-realizable settings. Notably, unlike prior distributional OPE methods, the theoretical guarantees of our method do not require the completeness assumption.

Distributional Off-policy Evaluation with Bellman Residual Minimization

TL;DR

Abstract

Paper Structure (99 sections, 14 theorems, 265 equations, 10 figures, 25 tables, 3 algorithms)

This paper contains 99 sections, 14 theorems, 265 equations, 10 figures, 25 tables, 3 algorithms.

INTRODUCTION
Expectation-extension For Distributional OPE
Statistical Error Bound
Relaxing Completeness
Summary
OFF-POLICY EVALUATION BASED ON BELLMAN EQUATION
Background
Existing Measures Of Bellman Residuals
Expectation-extended Distance
ENERGY BELLMAN RESIDUAL MINIMIZER
Estimated Bellman Residual
Statistical Error Bound
NON-REALIZABLE SETTINGS
Combating Non-realizability With Multi-step Extensions
Splitting-based Estimator
...and 84 more sections

Key Result

Theorem 2.2

Under Assumption RN_derivative, if the statistical distance $\eta$ satisfies translation-invariance, scale-sensitivity, convexity, and relaxed triangular inequality defined in Appendix properties_of_distance, then we can bound the expectation-based inaccuracy: for any $\Upsilon\in\mathcal{P}(\mathbb where $B(\gamma)$ does not depend on $\Upsilon$ and $B(\gamma)<\infty$ for all $0\leq \gamma<1$. Th

Figures (10)

Figure 1: $\mathcal{E}$-Inaccuracy for three different settings of Cartpole games based on small neural network model. Lines represent mean inaccuracy values and shaded regions represent the interval $(\text{Mean}\pm 2 \cdot \text{STD})$ (blue: EBRM, orange: QRDQN, green: MMDQN).
Figure 2: Larger $m$ makes $(\mathcal{T}^\pi)^m\Upsilon_\theta\approx \Upsilon_{\pi}$ in expected energy distance, and thereby leads to $\theta_*^{(m)}\approx \tilde{\theta}$.
Figure 3: At the top, red line represents the selected minimizer of each $F_m$ by R function optimize, which may be a non-global local minimizer, whereas the green line represents the best $\bar{\mathcal{E}}$-approximation. The bottom plots represent the corresponding pdf contour plots.
Figure 4: Histogram of Monte-Carlo approximated return distribution (marginalized) of Settings 1, 2, 3. Simulated with 100000 samples.
Figure 5: $\mathcal{E}$ and $\mathbb{W}_1$-inaccuracy based on small and big neural network models. Lines represent mean inaccuracy values and shaded regions represent the interval $(\text{Mean}\pm 2 \cdot \text{STD})$ (blue: EBRM, orange: QRDQN, green: MMDQN).
...and 5 more figures

Theorems & Definitions (20)

Theorem 2.2
Theorem 3.3
Remark 3.4
Remark 3.5
Remark 4.2
Theorem 4.4
Remark A.1
Lemma A.2
Lemma A.3
Corollary A.4
...and 10 more

Distributional Off-policy Evaluation with Bellman Residual Minimization

TL;DR

Abstract

Distributional Off-policy Evaluation with Bellman Residual Minimization

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (10)

Theorems & Definitions (20)