Table of Contents
Fetching ...

Model Selection for Off-policy Evaluation: New Algorithms and Experimental Protocol

Pai Liu, Lingfeng Zhao, Shivangi Agarwal, Jinghan Liu, Audrey Huang, Philip Amortila, Nan Jiang

TL;DR

This work develops new model-free and model-based selectors with theoretical guarantees with theoretical guarantees, and a new experimental protocol for empirically evaluating them, and exemplifies the protocol on Gym-Hopper, and finds that the new model-free selector, LSTD-Tournament, demonstrates promising empirical performance.

Abstract

Holdout validation and hyperparameter tuning from data is a long-standing problem in offline reinforcement learning (RL). A standard framework is to use off-policy evaluation (OPE) methods to evaluate and select the policies, but OPE either incurs exponential variance (e.g., importance sampling) or has hyperparameters on their own (e.g., FQE and model-based). We focus on hyperparameter tuning for OPE itself, which is even more under-investigated. Concretely, we select among candidate value functions ("model-free") or dynamics ("model-based") to best assess the performance of a target policy. Concretely, we select among candidate value functions (``model-free'') or dynamics models (``model-based'') to best assess the performance of a target policy. We develop: (1) new model-free and model-based selectors with theoretical guarantees, and (2) a new experimental protocol for empirically evaluating them. Compared to the model-free protocol in prior works, our new protocol allows for more stable generation and better control of candidate value functions in an optimization-free manner, and evaluation of model-free and model-based methods alike. We exemplify the protocol on Gym-Hopper, and find that our new model-free selector, LSTD-Tournament, demonstrates promising empirical performance.

Model Selection for Off-policy Evaluation: New Algorithms and Experimental Protocol

TL;DR

This work develops new model-free and model-based selectors with theoretical guarantees with theoretical guarantees, and a new experimental protocol for empirically evaluating them, and exemplifies the protocol on Gym-Hopper, and finds that the new model-free selector, LSTD-Tournament, demonstrates promising empirical performance.

Abstract

Holdout validation and hyperparameter tuning from data is a long-standing problem in offline reinforcement learning (RL). A standard framework is to use off-policy evaluation (OPE) methods to evaluate and select the policies, but OPE either incurs exponential variance (e.g., importance sampling) or has hyperparameters on their own (e.g., FQE and model-based). We focus on hyperparameter tuning for OPE itself, which is even more under-investigated. Concretely, we select among candidate value functions ("model-free") or dynamics ("model-based") to best assess the performance of a target policy. Concretely, we select among candidate value functions (``model-free'') or dynamics models (``model-based'') to best assess the performance of a target policy. We develop: (1) new model-free and model-based selectors with theoretical guarantees, and (2) a new experimental protocol for empirically evaluating them. Compared to the model-free protocol in prior works, our new protocol allows for more stable generation and better control of candidate value functions in an optimization-free manner, and evaluation of model-free and model-based methods alike. We exemplify the protocol on Gym-Hopper, and find that our new model-free selector, LSTD-Tournament, demonstrates promising empirical performance.

Paper Structure

This paper contains 64 sections, 7 theorems, 50 equations, 11 figures, 1 table.

Key Result

Theorem 1

Let $\Theta \subset \mathbb{R}^d$ be a set of parameters such that $\theta^\star \in \Theta$. Assume $\max_{s,a} \|\phi(s,a)\|_2 \leq B_\phi$ and $\max_{\theta \in \Theta} \|\theta\|_2 \leq 1$. Let $\widehat{\theta} := \mathop{\mathrm{arg\,min}}\limits_{\theta\in\Theta} \|\widehat{A} \theta - \wideh where $\sigma_{\min}(\cdot)$ is the smallest singular value.

Figures (11)

  • Figure 1: An illustration of the pipeline in practice that motivates our research (left) and our proposed experimental protocol (right), both for the model-free setting; the pipeline for the model-based setting is analogous and not visualized. Left: Training algorithms are run on training data to generate candidate policies, and choosing among them is the primary model-selection (MS) task (see Section \ref{['sec:intro']}) which is not the focus of our work. These policies (e.g., $\pi$) become the target policies in OPE (e.g., $\pi$), since accurate OPE can help solve the primary MS problem. Different FQE instances (e.g., with different hyperparameters, such as neural architectures) are used to approximate $Q^\pi$, producing $\{Q_i\}$. The selector takes MS data and $\{Q_i\}$ as input and choose one of them to estimate $J(\pi) \equiv J_{M^\star}(\pi)$. Right: Illustration of our protocol for an experiment unit (Section \ref{['sec:protocol']}). The target policies $\pi$ can be learned from separate training data but can also be produced in other ways, such as training on inaccurate models $M_i$, which can be realistic for practical scenarios with inaccurate simulators. $\{M_i\}$ is prepared by varying environment parameters. Monte-Carlo rollouts are used to generate the Q-values for the data points in the MS data, which avoids potentially unstable optimization in the OPE pipeline; this is the source of stability and controllability compared to the prior protocol that mimics the practical pipeline. For further discussion on the limitation of our protocol and the trade-offs, see Section \ref{['sec:conclusion']}.
  • Figure 2: Left:$J_M(\pi)$ in $M\in\mathcal{M}_\mathbf{g}$ (cf. Section \ref{['sec:exp-main']}) for different target policies. Right: Convergence of Monte-Carlo estimates of $J(\pi)$. Each curve corresponds to a target policy.
  • Figure 3: Main results for comparing model-free selectors in the gravity grid (MF.G; top row) and the noise grid (MF.N; bottom row). Each plot corresponds to a different $M^\star$ as indicated in the plot title. "mb_naive" is model-based but still included since it does not require Bellman operator rollouts.
  • Figure 4: Main results for comparing model-based selectors (cf. Appendix \ref{['app:model-based']} for details of methods). LSTD-Tournament is included as the best model-free selector for comparison, which surprisingly outperforms the more sophisticated model-based ones in Section \ref{['sec:mb-select']}.
  • Figure 5: Left: OPE error vs. simulator gaps. Middle: OPE error vs. misspecification. Right: OPE error vs. data coverage.
  • ...and 6 more figures

Theorems & Definitions (10)

  • Theorem 1
  • Theorem 2
  • Proposition 1
  • Theorem 3
  • Theorem 4
  • proof
  • proof
  • Lemma 1: Lemma 9 from xie2020batch
  • Lemma 2: Objective estimation error
  • proof : Proof of Lemma \ref{['lem:model-obj-concentration']}