Can Learned Optimization Make Reinforcement Learning Less Difficult?
Alexander David Goldie, Chris Lu, Matthew Thomas Jackson, Shimon Whiteson, Jakob Nicolaus Foerster
TL;DR
This work tackles three core RL difficulties—non-stationarity, plasticity loss, and exploration—by meta-learning a flexible, input-conditioned update rule called OPEN. OPEN uses a GRU-based architecture to generate per-parameter updates with a three-stage expression, including a learnable stochastic term to drive exploration: $\hat{u}_i = \alpha_1 m_i \exp(\alpha_2 e_i)$ and $\hat{u}^{actor,new}_i = \hat{u}^{actor}_i + \alpha_3 \delta^{actor}_i \epsilon$, with zero-mean corrections $u_i$ for stability, and parameter updates $p_i^{(t+1)} = p_i^{(t)} - u_i$. It is trained with Evolution Strategies on distributions of RL tasks and evaluated on single-task, multi-task, in-distribution, and out-of-support generalization across MinAtar and gridworld domains, outperforming or matching handcrafted optimizers and other learned optimizers, and showing transfer to 0-shot Craftax-Classic. Ablation studies show that each design choice—training-timescale inputs, layer-proportion cues, dormancy signals, and stochasticity—contributes to performance, particularly in exploration and handling plasticity loss. The results indicate OPEN’s potential as a foundation-like learned optimizer for RL, capable of generalizing across architectures and algorithms, albeit with substantial compute requirements and areas for curriculum- and scalability-focused future work.
Abstract
While reinforcement learning (RL) holds great potential for decision making in the real world, it suffers from a number of unique difficulties which often need specific consideration. In particular: it is highly non-stationary; suffers from high degrees of plasticity loss; and requires exploration to prevent premature convergence to local optima and maximize return. In this paper, we consider whether learned optimization can help overcome these problems. Our method, Learned Optimization for Plasticity, Exploration and Non-stationarity (OPEN), meta-learns an update rule whose input features and output structure are informed by previously proposed solutions to these difficulties. We show that our parameterization is flexible enough to enable meta-learning in diverse learning contexts, including the ability to use stochasticity for exploration. Our experiments demonstrate that when meta-trained on single and small sets of environments, OPEN outperforms or equals traditionally used optimizers. Furthermore, OPEN shows strong generalization characteristics across a range of environments and agent architectures.
