Table of Contents
Fetching ...

A Look at Value-Based Decision-Time vs. Background Planning Methods Across Different Settings

Safa Alver, Doina Precup

TL;DR

This paper tackles when value-based decision-time planning and background planning outperform each other in model-based RL, using the discounted return $J_{m}^{\pi}$ as the key metric. It analyzes both simplest tabular instantiations (OMCP vs Dyna-Q) and modern neural instantiations (deep MPC vs deep Dyna-Q), supported by theory and illustrative experiments in regular RL and transfer settings. The main finding is that simple instantiations are on par, but modern decision-time planning can match or exceed modern background planning in both regular RL and transfer, due to avoidance of harmful simulated updates and the ability to online-improve policies. The work informs understanding of planning method selection and suggests practical improvements to background planning for improved performance.

Abstract

In model-based reinforcement learning (RL), an agent can leverage a learned model to improve its way of behaving in different ways. Two of the prevalent ways to do this are through decision-time and background planning methods. In this study, we are interested in understanding how the value-based versions of these two planning methods will compare against each other across different settings. Towards this goal, we first consider the simplest instantiations of value-based decision-time and background planning methods and provide theoretical results on which one will perform better in the regular RL and transfer learning settings. Then, we consider the modern instantiations of them and provide hypotheses on which one will perform better in the same settings. Finally, we perform illustrative experiments to validate these theoretical results and hypotheses. Overall, our findings suggest that even though value-based versions of the two planning methods perform on par in their simplest instantiations, the modern instantiations of value-based decision-time planning methods can perform on par or better than the modern instantiations of value-based background planning methods in both the regular RL and transfer learning settings.

A Look at Value-Based Decision-Time vs. Background Planning Methods Across Different Settings

TL;DR

This paper tackles when value-based decision-time planning and background planning outperform each other in model-based RL, using the discounted return as the key metric. It analyzes both simplest tabular instantiations (OMCP vs Dyna-Q) and modern neural instantiations (deep MPC vs deep Dyna-Q), supported by theory and illustrative experiments in regular RL and transfer settings. The main finding is that simple instantiations are on par, but modern decision-time planning can match or exceed modern background planning in both regular RL and transfer, due to avoidance of harmful simulated updates and the ability to online-improve policies. The work informs understanding of planning method selection and suggests practical improvements to background planning for improved performance.

Abstract

In model-based reinforcement learning (RL), an agent can leverage a learned model to improve its way of behaving in different ways. Two of the prevalent ways to do this are through decision-time and background planning methods. In this study, we are interested in understanding how the value-based versions of these two planning methods will compare against each other across different settings. Towards this goal, we first consider the simplest instantiations of value-based decision-time and background planning methods and provide theoretical results on which one will perform better in the regular RL and transfer learning settings. Then, we consider the modern instantiations of them and provide hypotheses on which one will perform better in the same settings. Finally, we perform illustrative experiments to validate these theoretical results and hypotheses. Overall, our findings suggest that even though value-based versions of the two planning methods perform on par in their simplest instantiations, the modern instantiations of value-based decision-time planning methods can perform on par or better than the modern instantiations of value-based background planning methods in both the regular RL and transfer learning settings.
Paper Structure (14 sections, 2 theorems, 4 equations, 5 figures, 3 tables, 2 algorithms)

This paper contains 14 sections, 2 theorems, 4 equations, 5 figures, 3 tables, 2 algorithms.

Key Result

Proposition 1

Let $m_{O}\in\mathcal{M}$ and $m_{D}\in\mathcal{M}$ denote the models of the of the OMCP and Dyna-Q algorithms, respectively, and let $\pi_{m_O}$ and $\pi_{m_D}$ denote their final policies generated as a result of planning with their corresponding models. Then, when $m_O$ and $m_D$ become VE models

Figures (5)

  • Figure 1: The different planning methods within decision-time planning in which planning is done (a) by purely performing rollouts, and (b) by first performing some amount of search and then by either performing rollouts or bootstrapping on value estimates. The subscripts and superscripts on the states indicate the time steps and state identifiers, respectively. The black triangles indicate the terminal states.
  • Figure 2: (a) The SG environment. (b-e) The OneRoom, FourRooms, SCS9N1 and LCS9N1 environments in MiniGrid. (f-h) The source task of difficulty $0.35$ and target tasks of difficulties $0.35$ and $0.45$ in the RDS environment. The difficulty parameter here controls the density of the lava cells between the agent and the goal cell, and that the target tasks are just transposed versions of the source task. With every reset of the episode, a new lava cell pattern is procedurally generated for both the source and target tasks.
  • Figure 3: (a, b) The performance of OMCP and Dyna-Q in the (a) regular RL and (b) transfer learning settings. (c-k) The performance of deep MPC, deep Dyna-Q (abbreviated as MPC and Dyna-Q) and DQN in the (c-f) regular RL and (g-k) transfer learning settings. The black dashed lines indicate the performance of the optimal policy in the corresponding environment. The green and magenta dotted lines indicate the point after which the models of the decision-time and background planning algorithms become and remain as VE models, respectively. The shaded regions are confidence intervals over (a, b) 50 and (c-k) 25 runs.
  • Figure : The joint pseudocode of the OMCP algorithm (with an adaptable model and improving rollout policy) and the Dyna-Q algorithm. The blue and red colored parts are specific to the OMCP and Dyna-Q algorithms, respectively. That is, OMCP does not contain the red parts and Dyna-Q does not contain the blue parts.
  • Figure : The joint pseudocode of the deep MPC and deep Dyna-Q algorithms. The blue and red colored parts are specific to the deep MPC and deep Dyna-Q algorithms, respectively. That is, deep MPC does not contain the red parts and deep Dyna-Q does not contain the blue parts. Here, the imaginary replay buffer stores the state-action pairs that are to be used in generating simulated experience.

Theorems & Definitions (3)

  • Proposition 1
  • Proposition 1
  • proof