Ancestral Reinforcement Learning: Unifying Zeroth-Order Optimization and Genetic Algorithms for Reinforcement Learning

So Nakashima; Tetsuya J. Kobayashi

Ancestral Reinforcement Learning: Unifying Zeroth-Order Optimization and Genetic Algorithms for Reinforcement Learning

So Nakashima, Tetsuya J. Kobayashi

TL;DR

The key idea in ARL is that each agent within a population infers gradient by exploiting the history of its ancestors, i.e., the ancestor population in the past while maintaining the diversity of policies in the current population as in GA.

Abstract

Reinforcement Learning (RL) offers a fundamental framework for discovering optimal action strategies through interactions within unknown environments. Recent advancement have shown that the performance and applicability of RL can significantly be enhanced by exploiting a population of agents in various ways. Zeroth-Order Optimization (ZOO) leverages an agent population to estimate the gradient of the objective function, enabling robust policy refinement even in non-differentiable scenarios. As another application, Genetic Algorithms (GA) boosts the exploration of policy landscapes by mutational generation of policy diversity in an agent population and its refinement by selection. A natural question is whether we can have the best of two worlds that the agent population can have. In this work, we propose Ancestral Reinforcement Learning (ARL), which synergistically combines the robust gradient estimation of ZOO with the exploratory power of GA. The key idea in ARL is that each agent within a population infers gradient by exploiting the history of its ancestors, i.e., the ancestor population in the past, while maintaining the diversity of policies in the current population as in GA. We also theoretically reveal that the populational search in ARL implicitly induces the KL-regularization of the objective function, resulting in the enhanced exploration. Our results extend the applicability of populational algorithms for RL.

Ancestral Reinforcement Learning: Unifying Zeroth-Order Optimization and Genetic Algorithms for Reinforcement Learning

TL;DR

Abstract

Paper Structure (31 sections, 11 theorems, 76 equations, 3 figures, 6 algorithms)

This paper contains 31 sections, 11 theorems, 76 equations, 3 figures, 6 algorithms.

Introduction
Related Works
Preliminary
Reinforcement Learning
Entropy-regularized Reinforcement Learning
Zeroth-Order Optimization
Population Optimization via GA (POGA)
Ancestral Reinforcement Learning: Unification of ZOO and POGA
Theoretical Basis of ARL
Step 1: AL is grandient ascent for popluation fitness
Step 2: Implicit KL regularization
Experimental Study
Tableau MDP
Cartpole
Conclusion
...and 16 more sections

Key Result

Lemma 1

Assume that Algorithm alg:ga does not mutate the policies. If $\pi$ satisfies $\lambda(\pi) > \lambda(\pi')$ for any other $\pi'$ and the population size is large enough, then $\pi$ dominates the population.

Figures (3)

Figure 1: Schematic representation of Ancestral Reinforcement Learning (ARL). In ARL, agents of the next generation is selected from the current population according to their fitness defined by cumulative rewards observed by MDP simulation. After selection, the policy of each agent is updated by ancestral learning. At ancestral learning step, the policy is modified to imitate what the ancestor did using the empirical policy of the parent (ancestor). Owing to the survivorship bias, this update effectively works as a kind of gradient ascent.
Figure 2: Evaluation of ZOO (orange), POGA (green), and ARL (blue) for a tabeau MDP problem whose optimal cumulative reward is around $9.57$. The horizontal axis is the number of episodes whereas the vertical one is the cumulative reward of the best policy in the population at each iteration. Each solid line shows the average of the maximum cumulative reward obtained by five independent trials. The shaded zones around the curves are the standard deviation. For visualization, we take a moving average of the cumulative reward with window size $5$. Both ARL and ZOO achieve the optimal value, whereas POGA fails to find the optimal policy. We note that ZOO achieve the optimal value several times if we plot the trajectories of each trial.
Figure 3: Evaluation of ZOO (orange), POGA (green), and ARL (blue) for Cart Pols problem in OpenAI Gymnasium where the maximum cumulative reword is $500$. The format of the figure follows those in Figure \ref{['fig:tableu-mdp']} for ZOO and POGA. For ARL, the average and variance of the trajectories are obtained from the four trials out of five, in which ARL succeeded to achieve almost the optimal value. For visualization, we take moving average whose window size is ten.

Theorems & Definitions (18)

Lemma 1
Theorem 2
Proposition 3
Theorem 4
proof
Lemma 5
proof
Lemma 6
proof
Lemma 7
...and 8 more

Ancestral Reinforcement Learning: Unifying Zeroth-Order Optimization and Genetic Algorithms for Reinforcement Learning

TL;DR

Abstract

Ancestral Reinforcement Learning: Unifying Zeroth-Order Optimization and Genetic Algorithms for Reinforcement Learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (18)