Table of Contents
Fetching ...

Efficient Search for Customized Activation Functions with Gradient Descent

Lukas Strack, Mahmoud Safari, Frank Hutter

TL;DR

This work addresses the challenge of selecting or designing activation functions tailored to a given model and dataset. It introduces GRAFS, a gradient-based activation function search that treats activations as a differentiable space defined by a graph-like activation cell of unary and binary operations, optimized via bi-level gradient methods with a Dirichlet sampling scheme $Dir( ho)$. Key contributions include warmstarting, bounded-output regularization, progressive shrinking, and variance-reduced sampling, enabling search overhead orders of magnitude lower than prior approaches. Empirically, the method discovers specialized activations that improve performance for ResNet, ViT, and GPT families and transfer well to larger models and new datasets, demonstrating practical potential for automatic activation design in real pipelines.

Abstract

Different activation functions work best for different deep learning models. To exploit this, we leverage recent advancements in gradient-based search techniques for neural architectures to efficiently identify high-performing activation functions for a given application. We propose a fine-grained search cell that combines basic mathematical operations to model activation functions, allowing for the exploration of novel activations. Our approach enables the identification of specialized activations, leading to improved performance in every model we tried, from image classification to language models. Moreover, the identified activations exhibit strong transferability to larger models of the same type, as well as new datasets. Importantly, our automated process for creating customized activation functions is orders of magnitude more efficient than previous approaches. It can easily be applied on top of arbitrary deep learning pipelines and thus offers a promising practical avenue for enhancing deep learning architectures.

Efficient Search for Customized Activation Functions with Gradient Descent

TL;DR

This work addresses the challenge of selecting or designing activation functions tailored to a given model and dataset. It introduces GRAFS, a gradient-based activation function search that treats activations as a differentiable space defined by a graph-like activation cell of unary and binary operations, optimized via bi-level gradient methods with a Dirichlet sampling scheme . Key contributions include warmstarting, bounded-output regularization, progressive shrinking, and variance-reduced sampling, enabling search overhead orders of magnitude lower than prior approaches. Empirically, the method discovers specialized activations that improve performance for ResNet, ViT, and GPT families and transfer well to larger models and new datasets, demonstrating practical potential for automatic activation design in real pipelines.

Abstract

Different activation functions work best for different deep learning models. To exploit this, we leverage recent advancements in gradient-based search techniques for neural architectures to efficiently identify high-performing activation functions for a given application. We propose a fine-grained search cell that combines basic mathematical operations to model activation functions, allowing for the exploration of novel activations. Our approach enables the identification of specialized activations, leading to improved performance in every model we tried, from image classification to language models. Moreover, the identified activations exhibit strong transferability to larger models of the same type, as well as new datasets. Importantly, our automated process for creating customized activation functions is orders of magnitude more efficient than previous approaches. It can easily be applied on top of arbitrary deep learning pipelines and thus offers a promising practical avenue for enhancing deep learning architectures.
Paper Structure (25 sections, 4 equations, 5 figures, 15 tables, 2 algorithms)

This paper contains 25 sections, 4 equations, 5 figures, 15 tables, 2 algorithms.

Figures (5)

  • Figure 1: (Left) set of unary and binary operations. $\gamma$ is a learnable parameter that is trained along with the activation parameters and becomes frozen after the search is completed. $\sigma(x)$ is the sigmoid function, and $L,R$ are the left and right projection operations. (Right) activation cell: combination of unary and binary operations
  • Figure 2: (Bottom) Log-scaled distribution of epochs at which operations are dropped. (Top) Histogram determines number of operations to drop per epoch.
  • Figure 3: Plots of activation functions in Eq.\ref{['eq:rn']}, found on ResNet20 / CIFAR10.
  • Figure 4: Plots of activation functions in Eq.\ref{['eq:vit']}, found on ViT-Tiny / CIFAR10.
  • Figure 5: Plots of activation functions in Eq.\ref{['eq:gpt']} found on miniGPT / TinyStories.