Evolving Reservoirs for Meta Reinforcement Learning

Corentin Léger; Gautier Hamon; Eleni Nisioti; Xavier Hinaut; Clément Moulin-Frier

Evolving Reservoirs for Meta Reinforcement Learning

Corentin Léger, Gautier Hamon, Eleni Nisioti, Xavier Hinaut, Clément Moulin-Frier

TL;DR

This work investigates how neural architectures evolved at an evolutionary timescale can bias and accelerate learning during an agent's lifetime. It introduces ER-MRL, a framework that evolves reservoir hyperparameters with CMA-ES and uses the resulting reservoirs as contextual inputs to a meta-RL-enabled learner. The results show improved learning under partial observability, evidence of oscillatory reservoir dynamics that may resemble CPGs for locomotion, and notable generalization to unseen tasks and morphologies, albeit with task-dependent limits. The approach highlights a biologically inspired, compact, indirect encoding of neural structure that can reduce developmental learning costs and enable more versatile adaptation across sensorimotor tasks.

Abstract

Animals often demonstrate a remarkable ability to adapt to their environments during their lifetime. They do so partly due to the evolution of morphological and neural structures. These structures capture features of environments shared between generations to bias and speed up lifetime learning. In this work, we propose a computational model for studying a mechanism that can enable such a process. We adopt a computational framework based on meta reinforcement learning as a model of the interplay between evolution and development. At the evolutionary scale, we evolve reservoirs, a family of recurrent neural networks that differ from conventional networks in that one optimizes not the synaptic weights, but hyperparameters controlling macro-level properties of the resulting network architecture. At the developmental scale, we employ these evolved reservoirs to facilitate the learning of a behavioral policy through Reinforcement Learning (RL). Within an RL agent, a reservoir encodes the environment state before providing it to an action policy. We evaluate our approach on several 2D and 3D simulated environments. Our results show that the evolution of reservoirs can improve the learning of diverse challenging tasks. We study in particular three hypotheses: the use of an architecture combining reservoirs and reinforcement learning could enable (1) solving tasks with partial observability, (2) generating oscillatory dynamics that facilitate the learning of locomotion tasks, and (3) facilitating the generalization of learned behaviors to new tasks unknown during the evolution phase.

Evolving Reservoirs for Meta Reinforcement Learning

TL;DR

Abstract

Paper Structure (37 sections, 1 equation, 13 figures)

This paper contains 37 sections, 1 equation, 13 figures.

Introduction
Background
Reinforcement Learning as a model of development
Meta Reinforcement Learning as a model of the interplay between evolution and development
Reservoir computing as a model of neural structure generation
Evolutionary algorithms as a model of evolution
Evolving Reservoirs for Meta Reinforcement Learning (ER-MRL)
General approach
Inner loop
Outer loop
Evaluation
Results
Evolved reservoirs improve learning in highly partially observable environments
Evolved reservoirs could generate oscillatory dynamics that facilitate the learning of locomotion tasks
Evolved reservoirs improve generalization on new tasks unseen during evolution phase
...and 22 more sections

Figures (13)

Figure 1: (left) A simplified view of the evolution of brain structures. The generating parameters of neural structures are modified at an evolutionary loop. In the developmental loop, agents equipped with these neural structures learn to interact with their environment (right) Parallel to our computational approach. We propose a computational framework where an evolutionary algorithm optimizes hyperparameters that generate neural structures called reservoirs. These reservoirs are then integrated into RL agents that learn an action policy to maximize their reward in an environment
Figure 2: Our proposed architecture, called ER-MRL, integrates several ML paradigms. We consider an RL agent learning an action policy (a), having access to a reservoir (c). We consider two nested adaptive loops in the spirit of Meta-RL (b). Our proposed architecture (d) consists in evolving HPs $\phi$ for the generation of reservoirs in an outer loop. In an inner loop, the agent learns an action policy, that takes as input the neural activation of the reservoir. The policy is trained using RL in order to maximize episodic return. Section \ref{['methods']} provides the computational details of each ML paradigm.
Figure 3: In the evolution phase (top), CMA-ES refines Reservoir HPs $\Phi$. At each generation $i$ of the evolution loop (left), a population $\Phi_i : \{\Phi_i^1, \ldots, \Phi_i^n\}$ of HPs is sampled from the CMA-ES Gaussian distribution. Each $\Phi_i^j$ undergoes evaluation on multiple random seeds, generating multiple reservoirs. An ER-MRL agent is created for each reservoir, with its action policy being trained from the states of that reservoir (lighter grey frames). The fitness of a sampled $\Phi_i^j$ is determined by the average score of all ER-MRL agents generated from it (mid-grey frames). The fitness values are used to update the CMA-ES distribution for the next generation (dotted arrow). This process iterates until a predetermined threshold is reached. In the Testing phase (bottom), the best set of HPs $\Phi^{*}$ from all CMA-ES samples is employed. Multiple reservoirs are generated within ER-MRL agents, and their performance is evaluated.
Figure 4: Learning curves for partially observable tasks. The x-axis represents the number of timesteps during the training and the y-axis the mean episodic reward. Learning curves of our ER-MRL methods correspond to the testing phase described in the bottom of Fig. \ref{['details_fig']}. Vanilla RL corresponds to a feedforward policy RL agent. The curves and the shaded areas represent the mean and the standard deviation of the reward for 10 random seeds. See Section \ref{['part_obs_benchmark']} for a comparison with another method.
Figure 5: Learning curves for locomotion tasks. Same conventions as Fig. \ref{['po_fig']}
...and 8 more figures

Evolving Reservoirs for Meta Reinforcement Learning

TL;DR

Abstract

Evolving Reservoirs for Meta Reinforcement Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (13)