Deriving Activation Functions Using Integration

Allen Hao Huang; Imanol Schlag

Deriving Activation Functions Using Integration

Allen Hao Huang, Imanol Schlag

TL;DR

This work develops a gradient-centric paradigm for activation design by deriving activations through the integration of trainable gradient transforms. The resulting xIELU combines a linearly increasing positive-gradient (like ReLU$^2$) with a trainable negative-gradient (inspired by xSiLU), enabling adaptive nonlinearity across network depth; a companion xIPReLU variant offers a computationally lighter alternative. Empirical results on decoder-only Llama models (1.1B and 3B) trained to 125B tokens show that xIELU and xIPReLU achieve lower perplexities than ReLU$^2$ and SwiGLU at matched compute, with xIELU delivering the strongest gains and adaptive depth-wise behavior. Overall, the work demonstrates the promise of gradient-focused activation design for large-scale language modeling, suggesting practical benefits for training efficiency and performance in LLMs.

Abstract

Our work proposes a novel approach to designing activation functions by focusing on their gradients and deriving the corresponding activation functions using integration. We introduce the Expanded Integral of the Exponential Linear Unit (xIELU), a trainable piecewise activation function derived by integrating trainable affine transformations applied to the Exponential Linear Unit (ELU). xIELU combines two key properties for the gradient: (1) a trainable and linearly increasing gradient for positive inputs, similar to Squared ReLU (ReLU$^2$), and (2) a trainable gradient that can take negative values for negative inputs, inspired by Expanded SiLU (xSiLU). Conceptually, xIELU can be viewed as an extension of ReLU$^2$ to handle negative inputs. The trainable parameters in xIELU allow it to adaptively reduce its nonlinearity for higher-level representations deeper in the network. In experiments with 1.1B and 3B parameter Llama models trained on 125B tokens of FineWeb Edu, xIELU achieves lower perplexity compared to popular activation functions like ReLU$^2$ and SwiGLU when matched for the same compute cost and parameter count. A reference implementation is available at https://github.com/Anonymous5823/xielu.

Deriving Activation Functions Using Integration

TL;DR

) with a trainable negative-gradient (inspired by xSiLU), enabling adaptive nonlinearity across network depth; a companion xIPReLU variant offers a computationally lighter alternative. Empirical results on decoder-only Llama models (1.1B and 3B) trained to 125B tokens show that xIELU and xIPReLU achieve lower perplexities than ReLU

and SwiGLU at matched compute, with xIELU delivering the strongest gains and adaptive depth-wise behavior. Overall, the work demonstrates the promise of gradient-focused activation design for large-scale language modeling, suggesting practical benefits for training efficiency and performance in LLMs.

Abstract

), and (2) a trainable gradient that can take negative values for negative inputs, inspired by Expanded SiLU (xSiLU). Conceptually, xIELU can be viewed as an extension of ReLU

to handle negative inputs. The trainable parameters in xIELU allow it to adaptively reduce its nonlinearity for higher-level representations deeper in the network. In experiments with 1.1B and 3B parameter Llama models trained on 125B tokens of FineWeb Edu, xIELU achieves lower perplexity compared to popular activation functions like ReLU

and SwiGLU when matched for the same compute cost and parameter count. A reference implementation is available at https://github.com/Anonymous5823/xielu.

Deriving Activation Functions Using Integration

TL;DR

Abstract

Deriving Activation Functions Using Integration

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (6)