Table of Contents
Fetching ...

A Tutorial on Meta-Reinforcement Learning

Jacob Beck, Risto Vuorio, Evan Zheran Liu, Zheng Xiong, Luisa Zintgraf, Chelsea Finn, Shimon Whiteson

TL;DR

The paper surveys meta-reinforcement learning (meta-RL), addressing how to learn RL algorithms themselves to achieve faster adaptation across task distributions. It systematically categorizes problem settings into few-shot and many-shot regimes, multi-task vs single-task scenarios, and surveys three core inner-loop parameterizations (parameterized policy gradients, black-box sequence models, and task inference). Canonical methods like MAML and RL^2 are introduced, along with extensions covering exploration strategies, supervision regimes, and model-based variants, as well as theoretical analyses via Bayes-adaptive and other frameworks. The authors discuss applications in robotics and multi-agent RL, and identify open problems including generalization to broader task distributions, benchmarks, and the integration of offline data. The goal is to guide practitioners and researchers toward robust, generalizable meta-RL methods and to chart directions for future work.

Abstract

While deep reinforcement learning (RL) has fueled multiple high-profile successes in machine learning, it is held back from more widespread adoption by its often poor data efficiency and the limited generality of the policies it produces. A promising approach for alleviating these limitations is to cast the development of better RL algorithms as a machine learning problem itself in a process called meta-RL. Meta-RL is most commonly studied in a problem setting where, given a distribution of tasks, the goal is to learn a policy that is capable of adapting to any new task from the task distribution with as little data as possible. In this survey, we describe the meta-RL problem setting in detail as well as its major variations. We discuss how, at a high level, meta-RL research can be clustered based on the presence of a task distribution and the learning budget available for each individual task. Using these clusters, we then survey meta-RL algorithms and applications. We conclude by presenting the open problems on the path to making meta-RL part of the standard toolbox for a deep RL practitioner.

A Tutorial on Meta-Reinforcement Learning

TL;DR

The paper surveys meta-reinforcement learning (meta-RL), addressing how to learn RL algorithms themselves to achieve faster adaptation across task distributions. It systematically categorizes problem settings into few-shot and many-shot regimes, multi-task vs single-task scenarios, and surveys three core inner-loop parameterizations (parameterized policy gradients, black-box sequence models, and task inference). Canonical methods like MAML and RL^2 are introduced, along with extensions covering exploration strategies, supervision regimes, and model-based variants, as well as theoretical analyses via Bayes-adaptive and other frameworks. The authors discuss applications in robotics and multi-agent RL, and identify open problems including generalization to broader task distributions, benchmarks, and the integration of offline data. The goal is to guide practitioners and researchers toward robust, generalizable meta-RL methods and to chart directions for future work.

Abstract

While deep reinforcement learning (RL) has fueled multiple high-profile successes in machine learning, it is held back from more widespread adoption by its often poor data efficiency and the limited generality of the policies it produces. A promising approach for alleviating these limitations is to cast the development of better RL algorithms as a machine learning problem itself in a process called meta-RL. Meta-RL is most commonly studied in a problem setting where, given a distribution of tasks, the goal is to learn a policy that is capable of adapting to any new task from the task distribution with as little data as possible. In this survey, we describe the meta-RL problem setting in detail as well as its major variations. We discuss how, at a high level, meta-RL research can be clustered based on the presence of a task distribution and the learning budget available for each individual task. Using these clusters, we then survey meta-RL algorithms and applications. We conclude by presenting the open problems on the path to making meta-RL part of the standard toolbox for a deep RL practitioner.
Paper Structure (90 sections, 35 equations, 22 figures, 13 tables, 10 algorithms)

This paper contains 90 sections, 35 equations, 22 figures, 13 tables, 10 algorithms.

Figures (22)

  • Figure 1: Example of the fast adaptation meta-RL problem setting discussed in Section \ref{['sec:fast_adaptation']}. The agent (A) is meta-trained on a distribution of meta-training tasks to learn to go to goal position (X) located on a unit circle around its starting position (a). At meta-test time, the agent can adapt quickly (within a handful of episodes) to new tasks with initially unknown goal positions (b). In contrast, a standard RL algorithm may need hundreds of thousands of environment interactions when trained from scratch on one such task.
  • Figure 2: The relationship between the inner-loop and outer-loop in a meta-RL algorithm. The policy for MDP $i$, parameterized by $\phi^i$, produces the meta-trajectory data $\mathcal{D}^i$. The inner-loop $f_\theta$ computes adapted policy parameters for each MDP based on the meta-trajectory during the policy's interaction with the MDP. To compute the adapted parameters, the inner-loop can use all data collected in the MDP so far. The outer-loop computes updated meta-parameters using all of the meta-trajectories collected in all of the MDPs.
  • Figure 3: Meta-training consists of trials (or lifetimes), each broken up into multiple episodes from a single task (MDP). In this example, each trial consists of two episodes ($H=2$).
  • Figure 4: MAML in the problem setting (left) and conceptually (right). The meta-parameters $\theta$ are the initial parameters of the inner-loop policies $\phi_0$. The inner-loop computes new parameters $\phi_1^i$ adapted to task $i$ using one step of a policy gradient algorithm. The outer-loop updates the meta-parameters, from $\phi_0$ to $\phi_0'$, to optimize the performance of the policies after adaptation.
  • Figure 5: RL$^2$ in the problem setting (left) and conceptually (right). The inner-loop algorithm is implemented by an RNN parameterized by the meta-parameters $\theta$. The RNN takes as input the states, actions, and rewards from the environment. The RNN hidden state $\phi_t$ defines the task parameters at each timestep, which are passed as input to the MLP policy. The hidden state is not reset during a trial and instead carries over across episode boundaries. The outer-loop is a standard RL algorithm.
  • ...and 17 more figures