Table of Contents
Fetching ...

VariBAD: A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning

Luisa Zintgraf, Kyriacos Shiarlis, Maximilian Igl, Sebastian Schulze, Yarin Gal, Katja Hofmann, Shimon Whiteson

TL;DR

This paper tackles the challenge of exploration under deep Bayesian uncertainty in reinforcement learning by introducing variBAD, a meta-learned variational framework that infers a low-dimensional embedding of the unknown task and conditions the policy on the posterior over this embedding. By jointly learning a generative model of environment dynamics and rewards with an amortised encoder, variBAD enables online inference and Bayes-adaptive action selection without explicit planning at test time. The approach is evaluated on a simple gridworld and on MuJoCo continuous-control tasks, where variBAD closely matches Bayes-optimal behavior in exploration and achieves superior online returns compared to existing meta-RL baselines. The results demonstrate that combining Bayesian RL ideas with meta-learning and variational inference yields tractable, effective Bayes-adaptive policies for deep RL.

Abstract

Trading off exploration and exploitation in an unknown environment is key to maximising expected return during learning. A Bayes-optimal policy, which does so optimally, conditions its actions not only on the environment state but on the agent's uncertainty about the environment. Computing a Bayes-optimal policy is however intractable for all but the smallest tasks. In this paper, we introduce variational Bayes-Adaptive Deep RL (variBAD), a way to meta-learn to perform approximate inference in an unknown environment, and incorporate task uncertainty directly during action selection. In a grid-world domain, we illustrate how variBAD performs structured online exploration as a function of task uncertainty. We further evaluate variBAD on MuJoCo domains widely used in meta-RL and show that it achieves higher online return than existing methods.

VariBAD: A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning

TL;DR

This paper tackles the challenge of exploration under deep Bayesian uncertainty in reinforcement learning by introducing variBAD, a meta-learned variational framework that infers a low-dimensional embedding of the unknown task and conditions the policy on the posterior over this embedding. By jointly learning a generative model of environment dynamics and rewards with an amortised encoder, variBAD enables online inference and Bayes-adaptive action selection without explicit planning at test time. The approach is evaluated on a simple gridworld and on MuJoCo continuous-control tasks, where variBAD closely matches Bayes-optimal behavior in exploration and achieves superior online returns compared to existing meta-RL baselines. The results demonstrate that combining Bayesian RL ideas with meta-learning and variational inference yields tractable, effective Bayes-adaptive policies for deep RL.

Abstract

Trading off exploration and exploitation in an unknown environment is key to maximising expected return during learning. A Bayes-optimal policy, which does so optimally, conditions its actions not only on the environment state but on the agent's uncertainty about the environment. Computing a Bayes-optimal policy is however intractable for all but the smallest tasks. In this paper, we introduce variational Bayes-Adaptive Deep RL (variBAD), a way to meta-learn to perform approximate inference in an unknown environment, and incorporate task uncertainty directly during action selection. In a grid-world domain, we illustrate how variBAD performs structured online exploration as a function of task uncertainty. We further evaluate variBAD on MuJoCo domains widely used in meta-RL and show that it achieves higher online return than existing methods.

Paper Structure

This paper contains 24 sections, 10 equations, 8 figures.

Figures (8)

  • Figure 1: Illustration of different exploration strategies. (a) Environment: The agent starts at the bottom left and has to navigate to an unknown goal, located in the grey area. (b) A Bayes-optimal exploration strategy that systematically searches possible grid cells to find the goal, shown in solid (past actions) and dashed (future actions) blue lines. A simplified posterior is shown in the background in grey ($p=1/(\text{number of possible goal positions left})$ of containing the goal) and white ($p=0$). (c) Posterior sampling, which repeatedly samples a possible goal position (red squares) from the current posterior, takes the shortest route there, and updates its posterior. (d) Exploration strategy learned by variBAD. The grey background represents the approximate posterior the agent has learned. (e) Average return over all possible environments, over six episodes with 15 steps each (after which the agent is reset to the starting position). VariBAD results are averaged across $20$ random seeds. The performance of any exploration strategy is bounded above by the optimal behaviour (of a policy with access to the true goal position). The Bayes-optimal agent matches this behaviour from the second episode, whereas posterior sampling needs six rollouts. VariBAD closely approximates Bayes-optimal behaviour in this environment.
  • Figure 2: VariBAD architecture: A trajectory of states, actions and rewards is processed online using an RNN to produce the posterior over task embeddings, $q_{\phi}(m|\tau_{:t})$. The posterior is trained using a decoder which attempts to predict past and future states and rewards from current states and actions. The policy conditions on the posterior in order to act in the environment and is trained using RL.
  • Figure 3: Behaviour of variBAD in the gridworld environment. (a) Hand-picked but representative example test rollout. The blue background indicates the posterior probability of receiving a reward at that cell. (b) Probability of receiving a reward for each cell, as predicted by the decoder, over the course of interacting with the environment (average in black, goal state in green). (c) Visualisation of the latent space; each line is one latent dimension, the black line is the average.
  • Figure 4: Average test performance for the first $5$ rollouts of MuJoCo environments (using $5$ seeds).
  • Figure 5: Results for the gridworld toy environment. Results are averages over 20 seeds (with $95\%$ confidence intervals for the learning curve).
  • ...and 3 more figures