Table of Contents
Fetching ...

Can Learned Optimization Make Reinforcement Learning Less Difficult?

Alexander David Goldie, Chris Lu, Matthew Thomas Jackson, Shimon Whiteson, Jakob Nicolaus Foerster

TL;DR

This work tackles three core RL difficulties—non-stationarity, plasticity loss, and exploration—by meta-learning a flexible, input-conditioned update rule called OPEN. OPEN uses a GRU-based architecture to generate per-parameter updates with a three-stage expression, including a learnable stochastic term to drive exploration: $\hat{u}_i = \alpha_1 m_i \exp(\alpha_2 e_i)$ and $\hat{u}^{actor,new}_i = \hat{u}^{actor}_i + \alpha_3 \delta^{actor}_i \epsilon$, with zero-mean corrections $u_i$ for stability, and parameter updates $p_i^{(t+1)} = p_i^{(t)} - u_i$. It is trained with Evolution Strategies on distributions of RL tasks and evaluated on single-task, multi-task, in-distribution, and out-of-support generalization across MinAtar and gridworld domains, outperforming or matching handcrafted optimizers and other learned optimizers, and showing transfer to 0-shot Craftax-Classic. Ablation studies show that each design choice—training-timescale inputs, layer-proportion cues, dormancy signals, and stochasticity—contributes to performance, particularly in exploration and handling plasticity loss. The results indicate OPEN’s potential as a foundation-like learned optimizer for RL, capable of generalizing across architectures and algorithms, albeit with substantial compute requirements and areas for curriculum- and scalability-focused future work.

Abstract

While reinforcement learning (RL) holds great potential for decision making in the real world, it suffers from a number of unique difficulties which often need specific consideration. In particular: it is highly non-stationary; suffers from high degrees of plasticity loss; and requires exploration to prevent premature convergence to local optima and maximize return. In this paper, we consider whether learned optimization can help overcome these problems. Our method, Learned Optimization for Plasticity, Exploration and Non-stationarity (OPEN), meta-learns an update rule whose input features and output structure are informed by previously proposed solutions to these difficulties. We show that our parameterization is flexible enough to enable meta-learning in diverse learning contexts, including the ability to use stochasticity for exploration. Our experiments demonstrate that when meta-trained on single and small sets of environments, OPEN outperforms or equals traditionally used optimizers. Furthermore, OPEN shows strong generalization characteristics across a range of environments and agent architectures.

Can Learned Optimization Make Reinforcement Learning Less Difficult?

TL;DR

This work tackles three core RL difficulties—non-stationarity, plasticity loss, and exploration—by meta-learning a flexible, input-conditioned update rule called OPEN. OPEN uses a GRU-based architecture to generate per-parameter updates with a three-stage expression, including a learnable stochastic term to drive exploration: and , with zero-mean corrections for stability, and parameter updates . It is trained with Evolution Strategies on distributions of RL tasks and evaluated on single-task, multi-task, in-distribution, and out-of-support generalization across MinAtar and gridworld domains, outperforming or matching handcrafted optimizers and other learned optimizers, and showing transfer to 0-shot Craftax-Classic. Ablation studies show that each design choice—training-timescale inputs, layer-proportion cues, dormancy signals, and stochasticity—contributes to performance, particularly in exploration and handling plasticity loss. The results indicate OPEN’s potential as a foundation-like learned optimizer for RL, capable of generalizing across architectures and algorithms, albeit with substantial compute requirements and areas for curriculum- and scalability-focused future work.

Abstract

While reinforcement learning (RL) holds great potential for decision making in the real world, it suffers from a number of unique difficulties which often need specific consideration. In particular: it is highly non-stationary; suffers from high degrees of plasticity loss; and requires exploration to prevent premature convergence to local optima and maximize return. In this paper, we consider whether learned optimization can help overcome these problems. Our method, Learned Optimization for Plasticity, Exploration and Non-stationarity (OPEN), meta-learns an update rule whose input features and output structure are informed by previously proposed solutions to these difficulties. We show that our parameterization is flexible enough to enable meta-learning in diverse learning contexts, including the ability to use stochasticity for exploration. Our experiments demonstrate that when meta-trained on single and small sets of environments, OPEN outperforms or equals traditionally used optimizers. Furthermore, OPEN shows strong generalization characteristics across a range of environments and agent architectures.
Paper Structure (79 sections, 9 equations, 20 figures, 14 tables, 1 algorithm)

This paper contains 79 sections, 9 equations, 20 figures, 14 tables, 1 algorithm.

Figures (20)

  • Figure 1: A visualization of Open. We train $N$ agents, replacing the handcrafted optimizer of the RL loop with ones sampled from the meta-learner (i.e., evolution). Each optimizer conditions on gradient, momentum and additional inputs, detailed in Section \ref{['sec:features']}, to calculate updates. The final returns from each loop are output to the meta learner, which improves the optimizer before repeating the process. A single inner loop step is described algorithmically in Appendix \ref{['app:OPENAlgo']}.
  • Figure 2: IQM of final returns for the five single-task training environments, evaluated over 16 random environment seeds. We plot 95% stratified bootstrap confidence intervals for each environment.
  • Figure 3: Mean, IQM and optimality gap (smaller = better), evaluated over 16 random seeds per environment for the aggregated, Adam-normalized final returns after multi-task training on MinAtar young19minatargymnax2022github. We plot 95% stratified bootstrap confidence intervals for each metric.
  • Figure 4: IQM of return, normalized by Adam, in seven gridworlds, with 95% stratified bootstrap confidence intervals for 64 random seeds. On the left, we show performance in the distribution Open and Adam were trained and tuned in. On the right, we show OOS performance: the top row shows gridworlds from ohDiscoveringReinforcementLearning2021b, and the bottom row shows mazes from chevalier-boisvertMinigridMiniworldModular2023. We mark $\text{Hidden Size}=16$ as the in-distribution agent size for Open and Adam.
  • Figure 5: A comparison of OPEN and Adam with and without hyperparameter tuning in Craftax-Classic. We plot mean return over 32 seeds. Standard error is negligible (< 0.06).
  • ...and 15 more figures