Table of Contents
Fetching ...

Meta-learning of Sequential Strategies

Pedro A. Ortega, Jane X. Wang, Mark Rowland, Tim Genewein, Zeb Kurth-Nelson, Razvan Pascanu, Nicolas Heess, Joel Veness, Alex Pritzel, Pablo Sprechmann, Siddhant M. Jayakumar, Tom McGrath, Kevin Miller, Mohammad Azar, Ian Osband, Neil Rabinowitz, András György, Silvia Chiappa, Simon Osindero, Yee Whye Teh, Hado van Hasselt, Nando de Freitas, Matthew Botvinick, Shane Legg

TL;DR

The paper proposes memory-based meta-learning as a practical, data-efficient approach to building generalizable sequence-learning agents. It reframes meta-learning within a Bayesian perspective, showing that metamodels amortize Bayes-filtered data through memory-encoded sufficient statistics, effectively converting probabilistic inference into regression. Three concrete templates are presented: sequential predictors, Thompson-sampling agents, and Bayes-optimal decision-makers, each learned by a Monte Carlo objective to achieve near-optimal performance on broad task classes. While offering significant potential for sample efficiency and scalability, the work also discusses substantial challenges, including meta-training cost, task-structure design, and continual learning considerations for real-world deployment.

Abstract

In this report we review memory-based meta-learning as a tool for building sample-efficient strategies that learn from past experience to adapt to any task within a target class. Our goal is to equip the reader with the conceptual foundations of this tool for building new, scalable agents that operate on broad domains. To do so, we present basic algorithmic templates for building near-optimal predictors and reinforcement learners which behave as if they had a probabilistic model that allowed them to efficiently exploit task structure. Furthermore, we recast memory-based meta-learning within a Bayesian framework, showing that the meta-learned strategies are near-optimal because they amortize Bayes-filtered data, where the adaptation is implemented in the memory dynamics as a state-machine of sufficient statistics. Essentially, memory-based meta-learning translates the hard problem of probabilistic sequential inference into a regression problem.

Meta-learning of Sequential Strategies

TL;DR

The paper proposes memory-based meta-learning as a practical, data-efficient approach to building generalizable sequence-learning agents. It reframes meta-learning within a Bayesian perspective, showing that metamodels amortize Bayes-filtered data through memory-encoded sufficient statistics, effectively converting probabilistic inference into regression. Three concrete templates are presented: sequential predictors, Thompson-sampling agents, and Bayes-optimal decision-makers, each learned by a Monte Carlo objective to achieve near-optimal performance on broad task classes. While offering significant potential for sample efficiency and scalability, the work also discusses substantial challenges, including meta-training cost, task-structure design, and continual learning considerations for real-world deployment.

Abstract

In this report we review memory-based meta-learning as a tool for building sample-efficient strategies that learn from past experience to adapt to any task within a target class. Our goal is to equip the reader with the conceptual foundations of this tool for building new, scalable agents that operate on broad domains. To do so, we present basic algorithmic templates for building near-optimal predictors and reinforcement learners which behave as if they had a probabilistic model that allowed them to efficiently exploit task structure. Furthermore, we recast memory-based meta-learning within a Bayesian framework, showing that the meta-learned strategies are near-optimal because they amortize Bayes-filtered data, where the adaptation is implemented in the memory dynamics as a state-machine of sufficient statistics. Essentially, memory-based meta-learning translates the hard problem of probabilistic sequential inference into a regression problem.

Paper Structure

This paper contains 24 sections, 36 equations, 6 figures, 1 table, 3 algorithms.

Figures (6)

  • Figure 1: Basic computation graph for meta-learning a trajectory predictor. The loss function depends only on the trajectory $\tau$, not on the parameter $\theta$. Thus, the strategy $\pi$ must marginalize over the latent parameter $\theta$.
  • Figure 2: Computation graph for meta-learning a sequential prediction strategy. The agent function $f$ generates a prediction $\pi_{t}$ of the observation $x_{t}$ based on the past, captured in the last observation $x_{t-1}$ and the state $m_{t-1}$. The top diagram illustrates the computation graph for a whole sequence (of length $T=4$), while the lower diagram shows a detailed view of a single step computation.
  • Figure 3: Minimal state machine for a predictor of coin tosses with a fixed, unknown bias. The hypothesis class can be modeled as a 2-sided coin (see Example \ref{['exa:hypothesis-dice-roll']}). Dark and light state transitions correspond to observing the outcomes 'Head' and 'Tail' respectively, and the states are annotated with $(n_\text{H},n_\text{T})$, the number of times Head and Tail have been observed. The predictions made from each state are shown in the stacked bar charts: the probability of Head is $P(x_t=\text{H}|x_{<t})=\frac{n_\text{H}+1}{t+2}$ (which is how these predictions are implemented in a computer program). Note how different observations sequences can lead to the same state (e.g. HT and TH).
  • Figure 4: Meta-learned state machine for a predictor of coin tosses. The figure shows the memory dynamics of a standard memory-based predictor projected onto the first two eigenvectors. Notice the striking similarity with Figure \ref{['fig:minimal-state-machine']}. The predictor consists of 20 LSTM cells with softmax predictions, which was trained using Algorithm \ref{['alg:prediction']} on 1000 batches of 100 rollouts, where rollouts were of length 10. For training, we used the Adam optimization algorithm kingma2014adam.
  • Figure 5: Detail of the computation graph for meta-learning a Thompson sampling agent. The actions $a_t$ are generated from the agent's policy $\pi_t$. The expert policy $\pi^\ast_t$ is only used for generating a loss signal $\ell_t$.
  • ...and 1 more figures

Theorems & Definitions (2)

  • Example 1
  • Example 2