A Look at Value-Based Decision-Time vs. Background Planning Methods Across Different Settings
Safa Alver, Doina Precup
TL;DR
This paper tackles when value-based decision-time planning and background planning outperform each other in model-based RL, using the discounted return $J_{m}^{\pi}$ as the key metric. It analyzes both simplest tabular instantiations (OMCP vs Dyna-Q) and modern neural instantiations (deep MPC vs deep Dyna-Q), supported by theory and illustrative experiments in regular RL and transfer settings. The main finding is that simple instantiations are on par, but modern decision-time planning can match or exceed modern background planning in both regular RL and transfer, due to avoidance of harmful simulated updates and the ability to online-improve policies. The work informs understanding of planning method selection and suggests practical improvements to background planning for improved performance.
Abstract
In model-based reinforcement learning (RL), an agent can leverage a learned model to improve its way of behaving in different ways. Two of the prevalent ways to do this are through decision-time and background planning methods. In this study, we are interested in understanding how the value-based versions of these two planning methods will compare against each other across different settings. Towards this goal, we first consider the simplest instantiations of value-based decision-time and background planning methods and provide theoretical results on which one will perform better in the regular RL and transfer learning settings. Then, we consider the modern instantiations of them and provide hypotheses on which one will perform better in the same settings. Finally, we perform illustrative experiments to validate these theoretical results and hypotheses. Overall, our findings suggest that even though value-based versions of the two planning methods perform on par in their simplest instantiations, the modern instantiations of value-based decision-time planning methods can perform on par or better than the modern instantiations of value-based background planning methods in both the regular RL and transfer learning settings.
