Table of Contents
Fetching ...

Efficient Multi-Policy Evaluation for Reinforcement Learning

Shuze Daniel Liu, Claire Chen, Shangtong Zhang

TL;DR

This paper tackles the inefficiency of evaluating many target policies in reinforcement learning by introducing a tailored behavior policy that enables data sharing across policies. The authors formalize a variance-minimization framework and derive an optimal sampling distribution that, when extended to multi-step RL, yields unbiased off-policy estimates with lower variance than on-policy methods under policy similarity conditions. They prove key theoretical results showing that the proposed approach does not require restrictive assumptions and can achieve sample-efficiency gains, with variance reductions that do not scale with the number of policies. Empirically, the method (MPE) achieves state-of-the-art variance reduction on Gridworld and MuJoCo, using offline data to learn necessary quantities via FQE and outperforming several baselines by large margins. Overall, the work delivers a principled, scalable solution for efficient multi-policy evaluation with broad practical impact for policy selection and evaluation in RL.

Abstract

To unbiasedly evaluate multiple target policies, the dominant approach among RL practitioners is to run and evaluate each target policy separately. However, this evaluation method is far from efficient because samples are not shared across policies, and running target policies to evaluate themselves is actually not optimal. In this paper, we address these two weaknesses by designing a tailored behavior policy to reduce the variance of estimators across all target policies. Theoretically, we prove that executing this behavior policy with manyfold fewer samples outperforms on-policy evaluation on every target policy under characterized conditions. Empirically, we show our estimator has a substantially lower variance compared with previous best methods and achieves state-of-the-art performance in a broad range of environments.

Efficient Multi-Policy Evaluation for Reinforcement Learning

TL;DR

This paper tackles the inefficiency of evaluating many target policies in reinforcement learning by introducing a tailored behavior policy that enables data sharing across policies. The authors formalize a variance-minimization framework and derive an optimal sampling distribution that, when extended to multi-step RL, yields unbiased off-policy estimates with lower variance than on-policy methods under policy similarity conditions. They prove key theoretical results showing that the proposed approach does not require restrictive assumptions and can achieve sample-efficiency gains, with variance reductions that do not scale with the number of policies. Empirically, the method (MPE) achieves state-of-the-art variance reduction on Gridworld and MuJoCo, using offline data to learn necessary quantities via FQE and outperforming several baselines by large margins. Overall, the work delivers a principled, scalable solution for efficient multi-policy evaluation with broad practical impact for policy selection and evaluation in RL.

Abstract

To unbiasedly evaluate multiple target policies, the dominant approach among RL practitioners is to run and evaluate each target policy separately. However, this evaluation method is far from efficient because samples are not shared across policies, and running target policies to evaluate themselves is actually not optimal. In this paper, we address these two weaknesses by designing a tailored behavior policy to reduce the variance of estimators across all target policies. Theoretically, we prove that executing this behavior policy with manyfold fewer samples outperforms on-policy evaluation on every target policy under characterized conditions. Empirically, we show our estimator has a substantially lower variance compared with previous best methods and achieves state-of-the-art performance in a broad range of environments.
Paper Structure (21 sections, 14 theorems, 96 equations, 5 figures, 4 tables, 1 algorithm)

This paper contains 21 sections, 14 theorems, 96 equations, 5 figures, 4 tables, 1 algorithm.

Key Result

Lemma 1

$\forall \mu \in \Lambda, \forall k$,

Figures (5)

  • Figure 1: Results on Gridworld. Each curve is averaged over 900 runs (30 groups of policies, each having 30 independent runs). Shaded regions denote standard errors and are invisible for some curves because they are too small.
  • Figure 2: Results on MuJoCo. Each curve is averaged over 900 runs (30 groups of target policies, each having 30 independent runs). Shaded regions denote standard errors and are invisible for some curves because they are too small.
  • Figure 3: Results on Gridworld. Each curve is averaged over 900 runs (the corresponding target policies from 30 groups, each having 30 independent runs). Shaded regions denote standard errors and are invisible for some curves because they are too small.
  • Figure 4: Results on Gridworld. Each curve is averaged over 900 runs (the corresponding target policies from 30 groups, each having 30 independent runs). Shaded regions denote standard errors and are invisible for some curves because they are too small.
  • Figure 5: MuJoCo robot simulation tasks todorov2012mujoco. Pictures are adapted from liu2024efficient. Environments from the left to the right are Ant, Hopper, InvertedDoublePendulum, InvertedPendulum, and Walker.

Theorems & Definitions (25)

  • Lemma 1
  • Lemma 2
  • Lemma 3
  • Lemma 4
  • Theorem 1: Unbiasedness
  • Theorem 2: Behavior Policy Design
  • Theorem 3: Variance Reduction with Same Sample Sizes
  • Theorem 4: Variance Reduction
  • proof
  • proof
  • ...and 15 more