Table of Contents
Fetching ...

Off-policy Evaluation with Deeply-abstracted States

Meiling Hao, Pingfan Su, Liyuan Hu, Zoltan Szabo, Qingyuan Zhao, Chengchun Shi

TL;DR

A novel iterative procedure is proposed that sequentially projects the original state space into a smaller space, resulting in a deeply-abstracted state, which substantially simplifies the sample complexity of OPE arising from high cardinality.

Abstract

Off-policy evaluation (OPE) is crucial for assessing a target policy's impact offline before its deployment. However, achieving accurate OPE in large state spaces remains challenging. This paper studies state abstractions -- originally designed for policy learning -- in the context of OPE. Our contributions are three-fold: (i) We define a set of irrelevance conditions central to learning state abstractions for OPE, and derive a backward-model-irrelevance condition for achieving irrelevance in %sequential and (marginalized) importance sampling ratios by constructing a time-reversed Markov decision process (MDP). (ii) We propose a novel iterative procedure that sequentially projects the original state space into a smaller space, resulting in a deeply-abstracted state, which substantially simplifies the sample complexity of OPE arising from high cardinality. (iii) We prove the Fisher consistencies of various OPE estimators when applied to our proposed abstract state spaces.

Off-policy Evaluation with Deeply-abstracted States

TL;DR

A novel iterative procedure is proposed that sequentially projects the original state space into a smaller space, resulting in a deeply-abstracted state, which substantially simplifies the sample complexity of OPE arising from high cardinality.

Abstract

Off-policy evaluation (OPE) is crucial for assessing a target policy's impact offline before its deployment. However, achieving accurate OPE in large state spaces remains challenging. This paper studies state abstractions -- originally designed for policy learning -- in the context of OPE. Our contributions are three-fold: (i) We define a set of irrelevance conditions central to learning state abstractions for OPE, and derive a backward-model-irrelevance condition for achieving irrelevance in %sequential and (marginalized) importance sampling ratios by constructing a time-reversed Markov decision process (MDP). (ii) We propose a novel iterative procedure that sequentially projects the original state space into a smaller space, resulting in a deeply-abstracted state, which substantially simplifies the sample complexity of OPE arising from high cardinality. (iii) We prove the Fisher consistencies of various OPE estimators when applied to our proposed abstract state spaces.
Paper Structure (37 sections, 13 theorems, 51 equations, 4 figures, 1 table)

This paper contains 37 sections, 13 theorems, 51 equations, 4 figures, 1 table.

Key Result

Theorem 1

Assume Assumptions asmp:bounded -- VC hold. Then the followings hold for any $\pi$-irrelevant MSA $\phi$:

Figures (4)

  • Figure 1: Illustrations of the iterative procedure.
  • Figure 2: Illustrations of (a) the forward MDP model and (b) the backward MDP model. $b_t$ is a shorthand for $b(A_t|S_t)$ for any $t\ge 1$.
  • Figure 3: RMSEs and root absolute biases of various estimators for 4 different environments.
  • Figure E.1: Two illustrative examples.

Theorems & Definitions (22)

  • Definition 1: $\pi$-irrelevance
  • Definition 2: Markov State Abstraction
  • Theorem 1: Bias-variance decomposition
  • Theorem 2: Bias, variance and abstraction: The role of $\phi$ in OPE
  • Definition 3: $\pi^*$-irrelevance
  • Definition 4: $Q^{\pi^*}$-irrelevance
  • Definition 5: Model-irrelevance
  • Lemma 1
  • Definition 6: Backward-model-irrelevance
  • Lemma 2
  • ...and 12 more