Improving Generalization in Meta Reinforcement Learning using Learned Objectives
Louis Kirsch, Sjoerd van Steenkiste, Jürgen Schmidhuber
TL;DR
MetaGenRL presents a novel off-policy, gradient-based meta-learning framework that meta-learns a low-complexity neural objective to shape how future agents learn. By representing the objective with an LSTM-based network and optimizing via second-order gradients through a differentiable critic, it achieves strong generalization to environments vastly different from meta-training and improves sample efficiency over prior meta-RL approaches. The approach relies on a population of agents sharing a single learnable objective and leverages off-policy data to credit improvements in learning rules, enabling rapid adaptation at test time. Empirical results on continuous control tasks show MetaGenRL outperforming several baselines on unseen environments and approaching or surpassing human-engineered methods in some settings.
Abstract
Biological evolution has distilled the experiences of many learners into the general learning algorithms of humans. Our novel meta reinforcement learning algorithm MetaGenRL is inspired by this process. MetaGenRL distills the experiences of many complex agents to meta-learn a low-complexity neural objective function that decides how future individuals will learn. Unlike recent meta-RL algorithms, MetaGenRL can generalize to new environments that are entirely different from those used for meta-training. In some cases, it even outperforms human-engineered RL algorithms. MetaGenRL uses off-policy second-order gradients during meta-training that greatly increase its sample efficiency.
