Table of Contents
Fetching ...

Efficient Policy Evaluation with Offline Data Informed Behavior Policy Design

Shuze Liu, Shangtong Zhang

TL;DR

This paper proposes a tailored closed-form behavior policy that provably reduces the variance of an online Monte Carlo estimator, and designs efficient algorithms to learn this closed-form behavior policy from previously collected offline data.

Abstract

Most reinforcement learning practitioners evaluate their policies with online Monte Carlo estimators for either hyperparameter tuning or testing different algorithmic design choices, where the policy is repeatedly executed in the environment to get the average outcome. Such massive interactions with the environment are prohibitive in many scenarios. In this paper, we propose novel methods that improve the data efficiency of online Monte Carlo estimators while maintaining their unbiasedness. We first propose a tailored closed-form behavior policy that provably reduces the variance of an online Monte Carlo estimator. We then design efficient algorithms to learn this closed-form behavior policy from previously collected offline data. Theoretical analysis is provided to characterize how the behavior policy learning error affects the amount of reduced variance. Compared with previous works, our method achieves better empirical performance in a broader set of environments, with fewer requirements for offline data.

Efficient Policy Evaluation with Offline Data Informed Behavior Policy Design

TL;DR

This paper proposes a tailored closed-form behavior policy that provably reduces the variance of an online Monte Carlo estimator, and designs efficient algorithms to learn this closed-form behavior policy from previously collected offline data.

Abstract

Most reinforcement learning practitioners evaluate their policies with online Monte Carlo estimators for either hyperparameter tuning or testing different algorithmic design choices, where the policy is repeatedly executed in the environment to get the average outcome. Such massive interactions with the environment are prohibitive in many scenarios. In this paper, we propose novel methods that improve the data efficiency of online Monte Carlo estimators while maintaining their unbiasedness. We first propose a tailored closed-form behavior policy that provably reduces the variance of an online Monte Carlo estimator. We then design efficient algorithms to learn this closed-form behavior policy from previously collected offline data. Theoretical analysis is provided to characterize how the behavior policy learning error affects the amount of reduced variance. Compared with previous works, our method achieves better empirical performance in a broader set of environments, with fewer requirements for offline data.
Paper Structure (20 sections, 12 theorems, 91 equations, 4 figures, 3 tables, 1 algorithm)

This paper contains 20 sections, 12 theorems, 91 equations, 4 figures, 3 tables, 1 algorithm.

Key Result

Lemma 1

$\forall \mu \in \Lambda, \mathbb{E}_{A\sim\mu}\left[\rho(A)q(A)\right] = \mathbb{E}_{A\sim\pi}\left[q(A)\right].$

Figures (4)

  • Figure 1: Results on Gridworld. The curves are averaged over 900 trials (30 target policies, each having 30 independent runs). The shaded regions denote standard errors and are invisible for some curves because they are too small.
  • Figure 2: Results on Mujoco environments. Each curve is averaged over $900$ trials (30 target policies, each having 30 independent runs). The shaded regions denote standard errors and are invisible for some curves because they are too small.
  • Figure 3: MuJoCo todorov2012mujoco robot simulation tasks. MuJoCo is a physics engine for robotics simulation and contains various stochastic environments. The goal in each environment is to control a robot to achieve different behaviors such as walking, jumping, and balancing. Environments from the left to the right are Ant, Hopper, InvertedDoublePendulum, InvertedPendulum, and Walker. We conducted experiments on those five environments with results reported in Section \ref{['sec:experiment']}.
  • Figure 4: MuJoCo results using steps as the $x$-axis. We draw each curve from step $100$ because some policies need more than $100$ steps to finish the first episode. All curves are averaged over 900 trials (30 target policies, each having 30 independent runs). The shaded regions denote standard errors and are invisible because they are too small.

Theorems & Definitions (22)

  • Lemma 1
  • Lemma 2
  • Lemma 3
  • Theorem 1: Unbiasedness
  • Theorem 2: Optimal Behavior Policy
  • Theorem 3: Variance Reduction
  • Theorem 4
  • Theorem 5
  • proof
  • proof
  • ...and 12 more