Efficient Multi-Policy Evaluation for Reinforcement Learning
Shuze Daniel Liu, Claire Chen, Shangtong Zhang
TL;DR
This paper tackles the inefficiency of evaluating many target policies in reinforcement learning by introducing a tailored behavior policy that enables data sharing across policies. The authors formalize a variance-minimization framework and derive an optimal sampling distribution that, when extended to multi-step RL, yields unbiased off-policy estimates with lower variance than on-policy methods under policy similarity conditions. They prove key theoretical results showing that the proposed approach does not require restrictive assumptions and can achieve sample-efficiency gains, with variance reductions that do not scale with the number of policies. Empirically, the method (MPE) achieves state-of-the-art variance reduction on Gridworld and MuJoCo, using offline data to learn necessary quantities via FQE and outperforming several baselines by large margins. Overall, the work delivers a principled, scalable solution for efficient multi-policy evaluation with broad practical impact for policy selection and evaluation in RL.
Abstract
To unbiasedly evaluate multiple target policies, the dominant approach among RL practitioners is to run and evaluate each target policy separately. However, this evaluation method is far from efficient because samples are not shared across policies, and running target policies to evaluate themselves is actually not optimal. In this paper, we address these two weaknesses by designing a tailored behavior policy to reduce the variance of estimators across all target policies. Theoretically, we prove that executing this behavior policy with manyfold fewer samples outperforms on-policy evaluation on every target policy under characterized conditions. Empirically, we show our estimator has a substantially lower variance compared with previous best methods and achieves state-of-the-art performance in a broad range of environments.
