Table of Contents
Fetching ...

A Principled Path to Fitted Distributional Evaluation

Sungee Hong, Jiayi Wang, Zhengling Qi, Raymond K. W. Wong

TL;DR

This work develops a principled framework to extend fitted Q-evaluation to distributional off-policy evaluation (FDE), aiming to estimate the full return distribution under offline data. It introduces a distributional Bellman backup and a general objective that minimizes a discrepancy between the current conditional distribution and the Bellman backup, guided by contraction-inducing metrics and functional Bregman divergences. The authors prove convergence results in both tabular and non-tabular settings for a broad class of divergences (including l2-based, MMD-based, and KL-based measures) and provide concrete rates under various regularity assumptions, offering near-minimax guarantees in the tabular case and general guarantees beyond it. Empirically, FDE methods outperform traditional FQE-based approaches and strong baselines on LQR and Atari tasks, validating the framework’s theoretical claims and its practical utility for risk-aware distributional OPE in large-scale, offline RL scenarios.

Abstract

In reinforcement learning, distributional off-policy evaluation (OPE) focuses on estimating the return distribution of a target policy using offline data collected under a different policy. This work focuses on extending the widely used fitted Q-evaluation -- developed for expectation-based reinforcement learning -- to the distributional OPE setting. We refer to this extension as fitted distributional evaluation (FDE). While only a few related approaches exist, there remains no unified framework for designing FDE methods. To fill this gap, we present a set of guiding principles for constructing theoretically grounded FDE methods. Building on these principles, we develop several new FDE methods with convergence analysis and provide theoretical justification for existing methods, even in non-tabular environments. Extensive experiments, including simulations on linear quadratic regulators and Atari games, demonstrate the superior performance of the FDE methods.

A Principled Path to Fitted Distributional Evaluation

TL;DR

This work develops a principled framework to extend fitted Q-evaluation to distributional off-policy evaluation (FDE), aiming to estimate the full return distribution under offline data. It introduces a distributional Bellman backup and a general objective that minimizes a discrepancy between the current conditional distribution and the Bellman backup, guided by contraction-inducing metrics and functional Bregman divergences. The authors prove convergence results in both tabular and non-tabular settings for a broad class of divergences (including l2-based, MMD-based, and KL-based measures) and provide concrete rates under various regularity assumptions, offering near-minimax guarantees in the tabular case and general guarantees beyond it. Empirically, FDE methods outperform traditional FQE-based approaches and strong baselines on LQR and Atari tasks, validating the framework’s theoretical claims and its practical utility for risk-aware distributional OPE in large-scale, offline RL scenarios.

Abstract

In reinforcement learning, distributional off-policy evaluation (OPE) focuses on estimating the return distribution of a target policy using offline data collected under a different policy. This work focuses on extending the widely used fitted Q-evaluation -- developed for expectation-based reinforcement learning -- to the distributional OPE setting. We refer to this extension as fitted distributional evaluation (FDE). While only a few related approaches exist, there remains no unified framework for designing FDE methods. To fill this gap, we present a set of guiding principles for constructing theoretically grounded FDE methods. Building on these principles, we develop several new FDE methods with convergence analysis and provide theoretical justification for existing methods, even in non-tabular environments. Extensive experiments, including simulations on linear quadratic regulators and Atari games, demonstrate the superior performance of the FDE methods.

Paper Structure

This paper contains 91 sections, 22 theorems, 257 equations, 5 figures, 33 tables, 2 algorithms.

Key Result

Theorem 2.3

Suppose that a probability metric $\eta$ over $\mathcal{P}$ satisfies (S-L-C) with convexity parameter $q\ge 1$ and scale-sensitivity parameter $c>1/(2q)$ (see prop:metric_properties of Appendix sec:good_examples_metric). Then the expectation-extended metric $\bar{\eta}_{d_\pi, q}$ is a contraction-

Figures (5)

  • Figure 1: Mean (dots) and confidence region ($\text{mean}\pm \text{STD}$) for 5 seeds based on $N=10K$ samples: $\mathbb{W}_1$-inaccuracy (Y-axis) for each method (X-axis) for the games with $M=200$. See Figures \ref{['fig:mix10_comparison']}--\ref{['fig:mix200_comparison']} for simulation results in all seven games.
  • Figure 2: $\overline{\mathbb{W}}_{1,d_\pi,1}$-inaccuracy (Y-axis: logarithmic scale) for different sample sizes $N$ (X-axis) through 50 simulations. Shaded areas are $(\text{mean}\pm \text{STD}/\sqrt{50})$ regions for each method, with thick lines being the means.
  • Figure 3: (Mean $\pm$ STD) of $\mathbb{W}_1(\Upsilon_\theta^{d_\pi} , \Upsilon_\pi^{d\pi})$-inaccuracy in each games (columns) for different methods with $M=10$, $N=10K$: (row1) deterministic transition with strong coverage (${\epsilon_{\rm beh}}=0.4$), (row2) deterministic transition with weak coverage (${\epsilon_{\rm beh}}>0.4$), (row3) random transition with strong coverage (${\epsilon_{\rm beh}}=0.1$), (row4) random transition with weak coverage (${\epsilon_{\rm beh}}>0.1$).
  • Figure 4: (Mean $\pm$ STD) of $\mathbb{W}_1(\Upsilon_\theta^{d_\pi} , \Upsilon_\pi^{d\pi})$-inaccuracy in each games (columns) for different methods with $M=100$, $N=10K$: (row1) deterministic transition with strong coverage (${\epsilon_{\rm beh}}=0.4$), (row2) deterministic transition with weak coverage (${\epsilon_{\rm beh}}>0.4$), (row3) random transition with strong coverage (${\epsilon_{\rm beh}}=0.1$), (row4) random transition with weak coverage (${\epsilon_{\rm beh}}>0.1$).
  • Figure 5: (Mean $\pm$ STD) of $\mathbb{W}_1(\Upsilon_\theta^{d_\pi} , \Upsilon_\pi^{d\pi})$-inaccuracy in each games (columns) for different methods with $M=200$, $N=10K$: (row1) deterministic transition with strong coverage (${\epsilon_{\rm beh}}=0.4$), (row2) deterministic transition with weak coverage (${\epsilon_{\rm beh}}>0.4$), (row3) random transition with strong coverage (${\epsilon_{\rm beh}}=0.1$), (row4) random transition with weak coverage (${\epsilon_{\rm beh}}>0.1$).

Theorems & Definitions (27)

  • Remark 2.1
  • Definition 2.2
  • Theorem 2.3
  • Theorem 2.4
  • Remark 2.5
  • Theorem 3.3
  • Theorem 3.4: Simplified version of Theorem \ref{['thm:generalizd_convergence']}
  • Corollary B.1
  • Corollary B.2
  • Corollary B.3
  • ...and 17 more