A Principled Path to Fitted Distributional Evaluation
Sungee Hong, Jiayi Wang, Zhengling Qi, Raymond K. W. Wong
TL;DR
This work develops a principled framework to extend fitted Q-evaluation to distributional off-policy evaluation (FDE), aiming to estimate the full return distribution under offline data. It introduces a distributional Bellman backup and a general objective that minimizes a discrepancy between the current conditional distribution and the Bellman backup, guided by contraction-inducing metrics and functional Bregman divergences. The authors prove convergence results in both tabular and non-tabular settings for a broad class of divergences (including l2-based, MMD-based, and KL-based measures) and provide concrete rates under various regularity assumptions, offering near-minimax guarantees in the tabular case and general guarantees beyond it. Empirically, FDE methods outperform traditional FQE-based approaches and strong baselines on LQR and Atari tasks, validating the framework’s theoretical claims and its practical utility for risk-aware distributional OPE in large-scale, offline RL scenarios.
Abstract
In reinforcement learning, distributional off-policy evaluation (OPE) focuses on estimating the return distribution of a target policy using offline data collected under a different policy. This work focuses on extending the widely used fitted Q-evaluation -- developed for expectation-based reinforcement learning -- to the distributional OPE setting. We refer to this extension as fitted distributional evaluation (FDE). While only a few related approaches exist, there remains no unified framework for designing FDE methods. To fill this gap, we present a set of guiding principles for constructing theoretically grounded FDE methods. Building on these principles, we develop several new FDE methods with convergence analysis and provide theoretical justification for existing methods, even in non-tabular environments. Extensive experiments, including simulations on linear quadratic regulators and Atari games, demonstrate the superior performance of the FDE methods.
