Mining Generalizable Activation Functions

Alex Vitvitskyi; Michael Boratko; Matej Grcic; Razvan Pascanu; Deep Shah; Petar Veličković

Mining Generalizable Activation Functions

Alex Vitvitskyi, Michael Boratko, Matej Grcic, Razvan Pascanu, Deep Shah, Petar Veličković

TL;DR

The paper tackles the challenge of selecting activation functions by introducing a flexible, open-search paradigm that mines generalizable activations. It uses AlphaEvolve, an evolutionary framework guided by frontier language models, to explore an unbounded space of Python activation functions under a FLOP budget, with fitness driven by out-of-distribution generalization on synthetic tasks. Empirically, activations like GELUSine and GELU-Sinc-Perturbation emerge as robust across downstream benchmarks, with GELU-Sinc-Perturbation often delivering the best overall transfer. The work demonstrates that simple, periodic augmentations to proven activations can enhance OOD generalization, and that a small-scale lab protocol can yield activations that generalize to larger, more complex tasks.

Abstract

The choice of activation function is an active area of research, with different proposals aimed at improving optimization, while maintaining expressivity. Additionally, the activation function can significantly alter the implicit inductive bias of the architecture, controlling its non-linear behavior. In this paper, in line with previous work, we argue that evolutionary search provides a useful framework for finding new activation functions, while we also make two novel observations. The first is that modern pipelines, such as AlphaEvolve, which relies on frontier LLMs as a mutator operator, allows for a much wider and flexible search space; e.g., over all possible python functions within a certain FLOP budget, eliminating the need for manually constructed search spaces. In addition, these pipelines will be biased towards meaningful activation functions, given their ability to represent common knowledge, leading to a potentially more efficient search of the space. The second observation is that, through this framework, one can target not only performance improvements but also activation functions that encode particular inductive biases. This can be done by using performance on out-of-distribution data as a fitness function, reflecting the degree to which the architecture respects the inherent structure in the data in a manner independent of distribution shifts. We carry an empirical exploration of this proposal and show that relatively small scale synthetic datasets can be sufficient for AlphaEvolve to discover meaningful activations.

Mining Generalizable Activation Functions

TL;DR

Abstract

Paper Structure (29 sections, 2 equations, 9 figures, 4 tables)

This paper contains 29 sections, 2 equations, 9 figures, 4 tables.

Introduction
Contributions
AlphaEvolve for activation functions
Search
Datasets
Evaluation
Activation function analysis
Downstream evaluation
Discussion
Conclusions
Activation Functions Code
AlphaEvolve meta-prompt
GELUSine ablations
Datasets
Feynman Equations
...and 14 more sections

Figures (9)

Figure 1: Overall description of the evolutionary search framework we use to find activation functions. We rely on an evolutionary search powered by AlphaEvolve novikov25preprint, designed to optimize test performance for small-scale models trained on carefully constructed, synthetic datasets. We demonstrate that functions discovered in this way are capable of meaningful forms of generalization without sacrificing their general-purpose potency.
Figure 2: Visualization of in-distribution (training) and out-of-distribution (test) data for the one-dimensional target functions leveraged in our small-scale lab environment. Each function type tests a different kind of generalization in a way that supports rapid model training. Note that ID and OOD panels are different functions---there are no discontinuities in the functions we study.
Figure 3: Illustration of a typical AlphaEvolve evolution when asked to discover new activation functions. Early on, it rediscovers functions present in the literature (Swish/SiLU in this case). It then discovers interesting ways to recombine standard building blocks (polynomials, leaky ReLU, and square roots), at which point it reaches the best tradeoff between score and transferability. Soon after, AlphaEvolve realises that its function does not need to be pointwise, and leverages the batch axis of the input tensor to extract and exploit basic batch statistics. This quickly spirals into constructing highly elaborate functions that achieve excellent score, but heavily overfit to the specifics of the "lab dataset" by utilising multiple moments of the distribution.
Figure 4: Visualisation of four newly discovered pointwise activation functions by our system: Turbulent Activation Function, Gaussian-Modulated Tangent Unit (GMTU), GELUSine and GELU-Sinc-Perturbation.
Figure 5: Histogram of the pre-activation entries for trained model on synthetic dataset for:GELUSine and GELU-Sinc-Perturbation and ReLU. Note the range tends to be wide enough, i.e. $[-2,2]$, allowing the model to exploit the structure of the activation functions.
...and 4 more figures

Mining Generalizable Activation Functions

TL;DR

Abstract

Mining Generalizable Activation Functions

Authors

TL;DR

Abstract

Table of Contents

Figures (9)