Table of Contents
Fetching ...

Improving Generalization in Meta Reinforcement Learning using Learned Objectives

Louis Kirsch, Sjoerd van Steenkiste, Jürgen Schmidhuber

TL;DR

MetaGenRL presents a novel off-policy, gradient-based meta-learning framework that meta-learns a low-complexity neural objective to shape how future agents learn. By representing the objective with an LSTM-based network and optimizing via second-order gradients through a differentiable critic, it achieves strong generalization to environments vastly different from meta-training and improves sample efficiency over prior meta-RL approaches. The approach relies on a population of agents sharing a single learnable objective and leverages off-policy data to credit improvements in learning rules, enabling rapid adaptation at test time. Empirical results on continuous control tasks show MetaGenRL outperforming several baselines on unseen environments and approaching or surpassing human-engineered methods in some settings.

Abstract

Biological evolution has distilled the experiences of many learners into the general learning algorithms of humans. Our novel meta reinforcement learning algorithm MetaGenRL is inspired by this process. MetaGenRL distills the experiences of many complex agents to meta-learn a low-complexity neural objective function that decides how future individuals will learn. Unlike recent meta-RL algorithms, MetaGenRL can generalize to new environments that are entirely different from those used for meta-training. In some cases, it even outperforms human-engineered RL algorithms. MetaGenRL uses off-policy second-order gradients during meta-training that greatly increase its sample efficiency.

Improving Generalization in Meta Reinforcement Learning using Learned Objectives

TL;DR

MetaGenRL presents a novel off-policy, gradient-based meta-learning framework that meta-learns a low-complexity neural objective to shape how future agents learn. By representing the objective with an LSTM-based network and optimizing via second-order gradients through a differentiable critic, it achieves strong generalization to environments vastly different from meta-training and improves sample efficiency over prior meta-RL approaches. The approach relies on a population of agents sharing a single learnable objective and leverages off-policy data to credit improvements in learning rules, enabling rapid adaptation at test time. Empirical results on continuous control tasks show MetaGenRL outperforming several baselines on unseen environments and approaching or surpassing human-engineered methods in some settings.

Abstract

Biological evolution has distilled the experiences of many learners into the general learning algorithms of humans. Our novel meta reinforcement learning algorithm MetaGenRL is inspired by this process. MetaGenRL distills the experiences of many complex agents to meta-learn a low-complexity neural objective function that decides how future individuals will learn. Unlike recent meta-RL algorithms, MetaGenRL can generalize to new environments that are entirely different from those used for meta-training. In some cases, it even outperforms human-engineered RL algorithms. MetaGenRL uses off-policy second-order gradients during meta-training that greatly increase its sample efficiency.

Paper Structure

This paper contains 46 sections, 8 equations, 13 figures, 3 tables, 1 algorithm.

Figures (13)

  • Figure 1: A schematic of MetaGenRL. On the left a population of agents ($i \in 1, \hdots, N$), where each member consist of a critic $Q^{(i)}_\theta$ and a policy $\pi^{(i)}_\phi$ that interact with a particular environment $e^{(i)}$ and store collected data in a corresponding replay buffer $B^{(i)}$. On the right a meta-learned neural objective function $L_\alpha$ that is shared across the population. Learning (dotted arrows) proceeds as follows: Each policy is updated by differentiating $L_\alpha$, while the critic is updated using the usual TD-error (not shown). $L_\alpha$ is meta-learned by computing second-order gradients that can be obtained by differentiating through the critic.
  • Figure 2: An overview of $L_\alpha(\tau, x(\phi), V)$.
  • Figure 3: Comparing the test-time training behavior of the meta-learned objective functions by MetaGenRL to other (meta) reinforcement learning algorithms. We train randomly initialized agents on (a) environments that were encountered during training, and (b) on significantly different environments that were unseen. Training environments are denoted by $\dagger$ in the legend. All runs are shown with mean and standard deviation computed over multiple random seeds (MetaGenRL: 6 meta-train $\times$ 2 meta-test seeds, RL$^2$: 6 meta-train $\times$ 2 meta-test seeds, EPG: 3 meta-train $\times$ 2 meta-test seeds, and 6 seeds for all others).
  • Figure 4: Meta-training with 20 agents on Cheetah and Lunar. We test the objective function at five stages of meta-training by using it to train three randomly initialized agents on Hopper.
  • Figure 5: We meta-train MetaGenRL using several alternative parametrizations of $L_\alpha$ on a) Lunar and Cheetah, and b) present results of testing on Cheetah. During meta-training a representative example of a single agent population is shown with shaded regions denoting standard deviation across the population. Meta-test results are reported as per usual across 6 meta-train $\times$ 2 meta-test seeds.
  • ...and 8 more figures