Discovering Temporally-Aware Reinforcement Learning Algorithms

Matthew Thomas Jackson; Chris Lu; Louis Kirsch; Robert Tjarko Lange; Shimon Whiteson; Jakob Nicolaus Foerster

Discovering Temporally-Aware Reinforcement Learning Algorithms

Matthew Thomas Jackson, Chris Lu, Louis Kirsch, Robert Tjarko Lange, Shimon Whiteson, Jakob Nicolaus Foerster

TL;DR

This work addresses the problem of learning RL algorithms that adapt to the remaining training time by conditioning on lifetime information. It introduces temporally-adaptive variants TA-LPG and TA-LPO, and shows that Evolution Strategies can discover non-myopic updates that gradient-based meta-learning misses. Across grid-world, MinAtar, and Brax tasks, the temporally-aware methods generalize to unseen horizons and out-of-distribution environments, exhibiting dynamic learning schedules that balance exploration and exploitation. The findings highlight the importance of optimizing over the agent’s lifetime to uncover expressive, horizon-aware learning rules, with ES identified as a crucial driver of such discovery. The results suggest practical impact for robust RL systems that adapt their training behavior as they progress.

Abstract

Recent advancements in meta-learning have enabled the automatic discovery of novel reinforcement learning algorithms parameterized by surrogate objective functions. To improve upon manually designed algorithms, the parameterization of this learned objective function must be expressive enough to represent novel principles of learning (instead of merely recovering already established ones) while still generalizing to a wide range of settings outside of its meta-training distribution. However, existing methods focus on discovering objective functions that, like many widely used objective functions in reinforcement learning, do not take into account the total number of steps allowed for training, or "training horizon". In contrast, humans use a plethora of different learning objectives across the course of acquiring a new ability. For instance, students may alter their studying techniques based on the proximity to exam deadlines and their self-assessed capabilities. This paper contends that ignoring the optimization time horizon significantly restricts the expressive potential of discovered learning algorithms. We propose a simple augmentation to two existing objective discovery approaches that allows the discovered algorithm to dynamically update its objective function throughout the agent's training procedure, resulting in expressive schedules and increased generalization across different training horizons. In the process, we find that commonly used meta-gradient approaches fail to discover such adaptive objective functions while evolution strategies discover highly dynamic learning rules. We demonstrate the effectiveness of our approach on a wide range of tasks and analyze the resulting learned algorithms, which we find effectively balance exploration and exploitation by modifying the structure of their learning rules throughout the agent's lifetime.

Discovering Temporally-Aware Reinforcement Learning Algorithms

TL;DR

Abstract

Paper Structure (38 sections, 12 equations, 9 figures, 2 tables)

This paper contains 38 sections, 12 equations, 9 figures, 2 tables.

Introduction
Related Work
Meta-learning fundamentals
Meta-learning objective functions
Meta-optimization with Evolutionary / Zeroth-order methods
Background
Meta Reinforcement Learning
Learned Policy Gradient (LPG)
Learned Policy Optimisation (LPO)
Method
Conditioning on Agent Lifetime
Temporally-Adaptive LPG
Temporally-Adaptive LPO
Meta-Optimization for Lifetime Adaptation
Gradient-free vs. gradient-based meta-optimization
...and 23 more sections

Figures (9)

Figure 1: TA-LPG adapts to variable training horizons. Training curves for LPG and TA-LPG on held-out Grid-World environments from oh2020discovering (top) and Minigrid (bottom), for a range of total train steps. Since TA-LPG adapts to the training horizon, we plot individual training curves for each horizon (faded lines) and their final return (bold points), with the color gradient reflecting the horizon for each model. We observe the final return of TA-LPG at each horizon is consistently greater than the LPG return at the same point. Return is normalized against an A2C agent trained to convergence and averaged over 5 meta-train and 128 meta-test seeds.
Figure 2: TA-LPO leverages lifetime information and generalizes to a wide range of environments. Results of TA-LPO, LPO and PPO on the Brax and MinAtar suites across three seeds. TA-LPO was only meta-trained on SpaceInvaders-MinAtar. We provide complete training curves in \ref{['sec:in-depth-lpo']}.
Figure 3: Lifetime conditioning enables adaptation to training horizon. Policy entropy and update norm of TA-LPG and LPG over randomly-sampled Grid-Worlds.
Figure 4: TA-LPO learns to switch from optimism to pessimism. A visualization of the derivative of the TA-LPO objective at the beginning ($n=0$, left), middle ($n=N/2$, center), and end ($n=N$, right) of the training lifetime. The objective at $n=0$ appears to be optimistic and maximize entropy while the objective at $n=N$ appears to be pessimistic and minimize entropy.
Figure 5: Meta-gradient TA-LPG fails to adapt to temporal information.Left: Final return of TA-LPG trained with ES and meta-gradients on the meta-training distribution of randomly sampled Grid-Worlds. Individual training curves are plotted for each horizon (faint lines), but are indistinguishable for meta-gradient TA-LPG due to lack of horizon adaptation. Right: Policy update norm of the meta-gradient TA-LPG model over inner loop training, for a range of training horizons. We observe lower meta-gradient performance across horizons and no consistent sign of horizon adaptation in update norm.
...and 4 more figures

Discovering Temporally-Aware Reinforcement Learning Algorithms

TL;DR

Abstract

Discovering Temporally-Aware Reinforcement Learning Algorithms

Authors

TL;DR

Abstract

Table of Contents

Figures (9)