Evolution Transformer: In-Context Evolutionary Optimization

Robert Tjarko Lange; Yingtao Tian; Yujin Tang

Evolution Transformer: In-Context Evolutionary Optimization

Robert Tjarko Lange, Yingtao Tian, Yujin Tang

TL;DR

This work introduces Evolution Transformer, a causal Transformer architecture, which can flexibly characterize a family of Evolution Strategies and proposes a technique to fully self-referentially train the Evolution Transformer, starting from a random initialization and bootstrapping its own learning progress.

Abstract

Evolutionary optimization algorithms are often derived from loose biological analogies and struggle to leverage information obtained during the sequential course of optimization. An alternative promising approach is to leverage data and directly discover powerful optimization principles via meta-optimization. In this work, we follow such a paradigm and introduce Evolution Transformer, a causal Transformer architecture, which can flexibly characterize a family of Evolution Strategies. Given a trajectory of evaluations and search distribution statistics, Evolution Transformer outputs a performance-improving update to the search distribution. The architecture imposes a set of suitable inductive biases, i.e. the invariance of the distribution update to the order of population members within a generation and equivariance to the order of the search dimensions. We train the model weights using Evolutionary Algorithm Distillation, a technique for supervised optimization of sequence models using teacher algorithm trajectories. The resulting model exhibits strong in-context optimization performance and shows strong generalization capabilities to otherwise challenging neuroevolution tasks. We analyze the resulting properties of the Evolution Transformer and propose a technique to fully self-referentially train the Evolution Transformer, starting from a random initialization and bootstrapping its own learning progress. We provide an open source implementation under https://github.com/RobertTLange/evosax.

Evolution Transformer: In-Context Evolutionary Optimization

TL;DR

Abstract

Paper Structure (16 sections, 6 equations, 10 figures, 4 tables)

This paper contains 16 sections, 6 equations, 10 figures, 4 tables.

Introduction
Related Work & Background
Evolution Transformer: Population-Order In- & Dimension- Order Equivariant Search Updates
Supervised Evolutionary Algorithm Distillation clones various teacher BBO algorithms
Analysis: Evolution Transformer Captures Desirable Evolution Strategy Properties
Meta-Evolution of Evolution Transformer Weights Can Overfit The Meta-Training Task Distribution
Self-Referential Evolutionary Algorithm Distillation is feasible but can be unstable
Conclusion
Additional Results
Brax Learning Curves
Impact of Task Distribution for EAD
Evolution Transformer Features & Model Hyperparameters
Meta-Evolution Hyperparameters
Self-Referential Evolutionary Algorithm Distillation (SR-EAD) Hyperparameters
Software Dependencies
...and 1 more sections

Figures (10)

Figure 1: Evolution Transformer. We construct features resembling information from solution evaluations and the search distribution. They are processed by self-attention and Perceiver modules to obtain four separate embeddings. The stacked per-dimension embeddings are processed by standard Transformer encoder blocks. An MLP outputs the distribution update predictions. The model is invariant to the order of the population members and equivariant to the search dimension order.
Figure 2: Evolutionary Algorithm Distillation allows EvoTF to distill teacher algorithms. Top. KL distillation loss with different Transformer modules. Middle. Evaluation on a 14x14 MNIST CNN Classification task throughout distillation. Bottom. 'S+F' uses only the Solution and Fitness Perceiver. 'S+F+D' also uses the Distribution Attention and 'S+F+D+CD' uses all network modules. Evaluation on a Pendulum MLP Control task throughout distillation. Results are averaged across 3 independent runs.
Figure 3: Self-Attention and Perceiver maps for Evolution Transformer (EAD-trained on SNES) with a single attention block at a single generation. Top. Separable Sphere problem. Bottom. Non-separable Rosenbrock problem. All problems are 3-dimensional and use 5 population members. The fitness attention assigns higher credit to the best-performing population members. The distribution attention indicates that the EvoTF correctly infers whether the fitness landscape is separable.
Figure 4: Evolution Transformer (SNES) properties on characteristic problems. EvoTF correctly implements the core properties of unbiasedness on the random fitness (top row), translation invariance on a Sphere Task (middle row), and scale self-adaptation on the linear fitness (bottom row). All tasks consider 3 search dimensions and 5 population members The search mean is initialized between $[-3, 3]$.
Figure 5: Meta-Evolution of Evolution Transformer weights. We train the neural network weights using meta-black-box optimization lange2023discovering_eslange2023discovering_ga on a set of 5 BBOB problems. We compare evolving a EvoTF parametrization from scratch with fine-tuning an EAD-pretrained SNES-EvoTF initialization. While meta-evolution quickly improves performance on BBOB tasks, it tends to not generalize to neuroevolution tasks. Results are averaged across 3 independent runs.
...and 5 more figures

Evolution Transformer: In-Context Evolutionary Optimization

TL;DR

Abstract

Evolution Transformer: In-Context Evolutionary Optimization

Authors

TL;DR

Abstract

Table of Contents

Figures (10)