Efficient Policy Evaluation with Offline Data Informed Behavior Policy Design

Shuze Liu; Shangtong Zhang

Efficient Policy Evaluation with Offline Data Informed Behavior Policy Design

Shuze Liu, Shangtong Zhang

TL;DR

This paper proposes a tailored closed-form behavior policy that provably reduces the variance of an online Monte Carlo estimator, and designs efficient algorithms to learn this closed-form behavior policy from previously collected offline data.

Abstract

Most reinforcement learning practitioners evaluate their policies with online Monte Carlo estimators for either hyperparameter tuning or testing different algorithmic design choices, where the policy is repeatedly executed in the environment to get the average outcome. Such massive interactions with the environment are prohibitive in many scenarios. In this paper, we propose novel methods that improve the data efficiency of online Monte Carlo estimators while maintaining their unbiasedness. We first propose a tailored closed-form behavior policy that provably reduces the variance of an online Monte Carlo estimator. We then design efficient algorithms to learn this closed-form behavior policy from previously collected offline data. Theoretical analysis is provided to characterize how the behavior policy learning error affects the amount of reduced variance. Compared with previous works, our method achieves better empirical performance in a broader set of environments, with fewer requirements for offline data.

Efficient Policy Evaluation with Offline Data Informed Behavior Policy Design

TL;DR

Abstract

Paper Structure (20 sections, 12 theorems, 91 equations, 4 figures, 3 tables, 1 algorithm)

This paper contains 20 sections, 12 theorems, 91 equations, 4 figures, 3 tables, 1 algorithm.

Introduction
Background
Variance Reduction in Statistics
Variance Reduction in Reinforcement Learning
Learning Closed-Form Behavior Policies
Related Work
Empirical Results
Conclusion
Proofs
Proof of Lemma \ref{['lem stats unbiasedness']}
Proof of Lemma \ref{['lem:math-optimal']}
Proof of Lemma \ref{['lem:math-variance-0']}
Proof of Theorem \ref{['lem rl pdis unbaised']}
Proof of Theorem \ref{['lem:rl-optimal']}
Proof of Theorem \ref{['lem:var_smaller_stronger']}
...and 5 more sections

Key Result

Lemma 1

$\forall \mu \in \Lambda, \mathbb{E}_{A\sim\mu}\left[\rho(A)q(A)\right] = \mathbb{E}_{A\sim\pi}\left[q(A)\right].$

Figures (4)

Figure 1: Results on Gridworld. The curves are averaged over 900 trials (30 target policies, each having 30 independent runs). The shaded regions denote standard errors and are invisible for some curves because they are too small.
Figure 2: Results on Mujoco environments. Each curve is averaged over $900$ trials (30 target policies, each having 30 independent runs). The shaded regions denote standard errors and are invisible for some curves because they are too small.
Figure 3: MuJoCo todorov2012mujoco robot simulation tasks. MuJoCo is a physics engine for robotics simulation and contains various stochastic environments. The goal in each environment is to control a robot to achieve different behaviors such as walking, jumping, and balancing. Environments from the left to the right are Ant, Hopper, InvertedDoublePendulum, InvertedPendulum, and Walker. We conducted experiments on those five environments with results reported in Section \ref{['sec:experiment']}.
Figure 4: MuJoCo results using steps as the $x$-axis. We draw each curve from step $100$ because some policies need more than $100$ steps to finish the first episode. All curves are averaged over 900 trials (30 target policies, each having 30 independent runs). The shaded regions denote standard errors and are invisible because they are too small.

Theorems & Definitions (22)

Lemma 1
Lemma 2
Lemma 3
Theorem 1: Unbiasedness
Theorem 2: Optimal Behavior Policy
Theorem 3: Variance Reduction
Theorem 4
Theorem 5
proof
proof
...and 12 more

Efficient Policy Evaluation with Offline Data Informed Behavior Policy Design

TL;DR

Abstract

Efficient Policy Evaluation with Offline Data Informed Behavior Policy Design

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (22)