Distributional Off-policy Evaluation with Bellman Residual Minimization
Sungee Hong, Zhengling Qi, Raymond K. W. Wong
TL;DR
This work tackles distributional off-policy evaluation with offline data by replacing supremum-based distributional distances with expectation-based distances, specifically leveraging energy distance. The authors introduce Energy Bellman Residual Minimizer (EBRM), a Bellman-residual minimization framework that estimates the target return distribution Υ_π by optimizing over a parametric family Υ_θ using the energy-distance Bellman residual, and provide finite-sample guarantees under realizability. To address non-realizable settings, they develop a multi-step extension and practical estimators (splitting and bootstrap) with theoretical risk bounds, showing improvements over prior methods that rely on completeness. Empirical results on OpenAI Gym tasks demonstrate strong performance and robustness to misspecification, highlighting the method’s practical value for offline, distributional RL and risk-sensitive decision making.
Abstract
We study distributional off-policy evaluation (OPE), of which the goal is to learn the distribution of the return for a target policy using offline data generated by a different policy. The theoretical foundation of many existing work relies on the supremum-extended statistical distances such as supremum-Wasserstein distance, which are hard to estimate. In contrast, we study the more manageable expectation-extended statistical distances and provide a novel theoretical justification on their validity for learning the return distribution. Based on this attractive property, we propose a new method called Energy Bellman Residual Minimizer (EBRM) for distributional OPE. We provide corresponding in-depth theoretical analyses. We establish a finite-sample error bound for the EBRM estimator under the realizability assumption. Furthermore, we introduce a variant of our method based on a multi-step extension which improves the error bound for non-realizable settings. Notably, unlike prior distributional OPE methods, the theoretical guarantees of our method do not require the completeness assumption.
