Table of Contents
Fetching ...

Gompertz Linear Units: Leveraging Asymmetry for Enhanced Learning Dynamics

Indrashis Das, Mahmoud Safari, Steven Adriaensen, Frank Hutter

TL;DR

The paper proposes GoLU, a self-gated activation using the asymmetric Gompertz gate to improve learning dynamics. Defined as $GoLU(x)=x\,e^{-e^{-x}}$, the gate reduces activation variance and yields a smoother loss landscape while maintaining robust gradient flow. Through extensive experiments across ImageNet, CIFAR, language modeling, semantic and instance segmentation, object detection, diffusion models, and machine translation, GoLU often outperforms ReLU, GELU, Swish, Mish, and other baselines, with a dedicated CUDA kernel ensuring practical training and inference speed. The work presents GoLU as a robust, scalable alternative for modern neural networks with broad potential impact across domains.

Abstract

Activation functions are fundamental elements of deep learning architectures as they significantly influence training dynamics. ReLU, while widely used, is prone to the dying neuron problem, which has been mitigated by variants such as LeakyReLU, PReLU, and ELU that better handle negative neuron outputs. Recently, self-gated activations like GELU and Swish have emerged as state-of-the-art alternatives, leveraging their smoothness to ensure stable gradient flow and prevent neuron inactivity. In this work, we introduce the Gompertz Linear Unit (GoLU), a novel self-gated activation function defined as $\mathrm{GoLU}(x) = x \, \mathrm{Gompertz}(x)$, where $\mathrm{Gompertz}(x) = e^{-e^{-x}}$. The GoLU activation leverages the right-skewed asymmetry in the Gompertz function to reduce variance in the latent space more effectively compared to GELU and Swish, while preserving robust gradient flow. Extensive experiments across diverse tasks, including Image Classification, Language Modeling, Semantic Segmentation, Object Detection, Instance Segmentation, and Diffusion, highlight GoLU's superior performance relative to state-of-the-art activation functions, establishing GoLU as a robust alternative to existing activation functions.

Gompertz Linear Units: Leveraging Asymmetry for Enhanced Learning Dynamics

TL;DR

The paper proposes GoLU, a self-gated activation using the asymmetric Gompertz gate to improve learning dynamics. Defined as , the gate reduces activation variance and yields a smoother loss landscape while maintaining robust gradient flow. Through extensive experiments across ImageNet, CIFAR, language modeling, semantic and instance segmentation, object detection, diffusion models, and machine translation, GoLU often outperforms ReLU, GELU, Swish, Mish, and other baselines, with a dedicated CUDA kernel ensuring practical training and inference speed. The work presents GoLU as a robust, scalable alternative for modern neural networks with broad potential impact across domains.

Abstract

Activation functions are fundamental elements of deep learning architectures as they significantly influence training dynamics. ReLU, while widely used, is prone to the dying neuron problem, which has been mitigated by variants such as LeakyReLU, PReLU, and ELU that better handle negative neuron outputs. Recently, self-gated activations like GELU and Swish have emerged as state-of-the-art alternatives, leveraging their smoothness to ensure stable gradient flow and prevent neuron inactivity. In this work, we introduce the Gompertz Linear Unit (GoLU), a novel self-gated activation function defined as , where . The GoLU activation leverages the right-skewed asymmetry in the Gompertz function to reduce variance in the latent space more effectively compared to GELU and Swish, while preserving robust gradient flow. Extensive experiments across diverse tasks, including Image Classification, Language Modeling, Semantic Segmentation, Object Detection, Instance Segmentation, and Diffusion, highlight GoLU's superior performance relative to state-of-the-art activation functions, establishing GoLU as a robust alternative to existing activation functions.

Paper Structure

This paper contains 36 sections, 14 equations, 26 figures, 10 tables.

Figures (26)

  • Figure 1: Activation functions (Left) and their corresponding gate functions (Right). GoLU and its gate, the Gompertz function, are highlighted in red. Note the slight rightward shift of the Gompertz gate.
  • Figure 2: Comparison of the distributions underlying the gate functions (Left) and the gradients (Right) of various gated activations. The Gumbel distribution exhibits a slight rightward skew.
  • Figure 3: Image created by Dall-E 3 (Left) and kernel density estimation curves for distributions of activation outputs for the image (Right). GoLU reduces variance most compared to baseline activations.
  • Figure 4: Distributions of final activation outputs of ResNet-50 trained on ImageNet-1k for three randomly sampled images from ImageNet-1k. GoLU leads to a more peaked distribution for the final activation output.
  • Figure 5: The loss landscape on the test set of ResNet-20 trained on CIFAR-10 with ReLU, GELU, Swish and GoLU after adding random, scaled perturbations to the learned weights (refer to Appendix \ref{['app:loss-landscape']} for more details).
  • ...and 21 more figures