Table of Contents
Fetching ...

VariBASed: Variational Bayes-Adaptive Sequential Monte-Carlo Planning for Deep Reinforcement Learning

Joery A. de Vries, Jinke He, Yaniv Oren, Pascal R. van der Vaart, Mathijs M. de Weerdt, Matthijs T. J. Spaan

TL;DR

A variational framework for learning and planning in Bayes-adaptive Markov decision processes that coalesces variational belief learning, sequential Monte-Carlo planning, and meta-reinforcement learning is proposed.

Abstract

Optimally trading-off exploration and exploitation is the holy grail of reinforcement learning as it promises maximal data-efficiency for solving any task. Bayes-optimal agents achieve this, but obtaining the belief-state and performing planning are both typically intractable. Although deep learning methods can greatly help in scaling this computation, existing methods are still costly to train. To accelerate this, this paper proposes a variational framework for learning and planning in Bayes-adaptive Markov decision processes that coalesces variational belief learning, sequential Monte-Carlo planning, and meta-reinforcement learning. In a single-GPU setup, our new method VariBASeD exhibits favorable scaling to larger planning budgets, improving sample- and runtime-efficiency over prior methods.

VariBASed: Variational Bayes-Adaptive Sequential Monte-Carlo Planning for Deep Reinforcement Learning

TL;DR

A variational framework for learning and planning in Bayes-adaptive Markov decision processes that coalesces variational belief learning, sequential Monte-Carlo planning, and meta-reinforcement learning is proposed.

Abstract

Optimally trading-off exploration and exploitation is the holy grail of reinforcement learning as it promises maximal data-efficiency for solving any task. Bayes-optimal agents achieve this, but obtaining the belief-state and performing planning are both typically intractable. Although deep learning methods can greatly help in scaling this computation, existing methods are still costly to train. To accelerate this, this paper proposes a variational framework for learning and planning in Bayes-adaptive Markov decision processes that coalesces variational belief learning, sequential Monte-Carlo planning, and meta-reinforcement learning. In a single-GPU setup, our new method VariBASeD exhibits favorable scaling to larger planning budgets, improving sample- and runtime-efficiency over prior methods.
Paper Structure (30 sections, 4 theorems, 23 equations, 6 figures, 2 tables, 2 algorithms)

This paper contains 30 sections, 4 theorems, 23 equations, 6 figures, 2 tables, 2 algorithms.

Key Result

Theorem 2.1

The policy $p_{\pi}^+(A_t | S_t, b_t, \mathcal{O}_{t:T}) \propto \Phi_{t:T} \cdot \pi^+(A_t | S_t, b_t)$ satisfies, which is a regularized policy objective, where $q^*$ also guarantees a policy improvement in the unregularized MDP, $\mathbb{E}_{p_{q^*}} [\sum_{t=1}^T R_t] \ge \mathbb{E}_{p_{\pi^+}} [\sum_{t=1}^{T} R_t].$

Figures (6)

  • Figure 1: Generative model (left), and the unstructured (middle) vs. structured inference model (right) that we consider. Stochastic variables are placed in circles, deterministic variables in rectangles. The double circle for $\langle S_t, b_t \rangle$ indicates a "joint" variable, however, note that $S_t$ is stochastic whereas the belief evolves deterministically. Colored nodes are observed, blank nodes are latent.
  • Figure 2: Evaluation rollouts on the function optimization and gridworld problems over multiple test-episodes. Due to the stochastic nature of our method, we visualize one sample rollout for the function optimization problem (with average metrics), and the averaged rollouts for the gridworld. For the gridworld, the circle indicates the starting-tile and the cross the goal-tile. On the function optimization we used a planning budget of $H=1, K=16$ and on the gridworld $H=4, K=32$.
  • Figure 3: Learning curves for all environments comparing our VariBASeD against recurrent PPO (RL$^2$) duan_rl2_2016 using the S5 architecture lu_s5rl_2023 for different planning budgets. Shaded regions give 99% two-sided BCa-bootstrap intervals over 30 seeds. In our framework we found that the PPO baseline did not learn.
  • Figure 4: Visualization of environments. Left: continuous function optimization task, where each color corresponds to a distinct function that the agent has to learn to optimize from sequential interactions. Right: discrete opengrid task where the agent needs to find the goal tile $\times$ over multiple episodes starting from the dot.
  • Figure : Inner-Loop; BA-SMC planner
  • ...and 1 more figures

Theorems & Definitions (4)

  • Theorem 2.1: Proof Appendix \ref{['vb:proof:proxy_cai']}
  • Proposition 2.2: Proof App. \ref{['vb:proof:em']}
  • Corollary 3.1: Proof Appendix \ref{['vb:proof:is_weights']}
  • Proposition 3.2: Proof Appendix \ref{['vb:proof:kl_to_elbo_swap']}