VariBAD: A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning
Luisa Zintgraf, Kyriacos Shiarlis, Maximilian Igl, Sebastian Schulze, Yarin Gal, Katja Hofmann, Shimon Whiteson
TL;DR
This paper tackles the challenge of exploration under deep Bayesian uncertainty in reinforcement learning by introducing variBAD, a meta-learned variational framework that infers a low-dimensional embedding of the unknown task and conditions the policy on the posterior over this embedding. By jointly learning a generative model of environment dynamics and rewards with an amortised encoder, variBAD enables online inference and Bayes-adaptive action selection without explicit planning at test time. The approach is evaluated on a simple gridworld and on MuJoCo continuous-control tasks, where variBAD closely matches Bayes-optimal behavior in exploration and achieves superior online returns compared to existing meta-RL baselines. The results demonstrate that combining Bayesian RL ideas with meta-learning and variational inference yields tractable, effective Bayes-adaptive policies for deep RL.
Abstract
Trading off exploration and exploitation in an unknown environment is key to maximising expected return during learning. A Bayes-optimal policy, which does so optimally, conditions its actions not only on the environment state but on the agent's uncertainty about the environment. Computing a Bayes-optimal policy is however intractable for all but the smallest tasks. In this paper, we introduce variational Bayes-Adaptive Deep RL (variBAD), a way to meta-learn to perform approximate inference in an unknown environment, and incorporate task uncertainty directly during action selection. In a grid-world domain, we illustrate how variBAD performs structured online exploration as a function of task uncertainty. We further evaluate variBAD on MuJoCo domains widely used in meta-RL and show that it achieves higher online return than existing methods.
