Table of Contents
Fetching ...

Deriving Activation Functions Using Integration

Allen Hao Huang, Imanol Schlag

TL;DR

This work develops a gradient-centric paradigm for activation design by deriving activations through the integration of trainable gradient transforms. The resulting xIELU combines a linearly increasing positive-gradient (like ReLU$^2$) with a trainable negative-gradient (inspired by xSiLU), enabling adaptive nonlinearity across network depth; a companion xIPReLU variant offers a computationally lighter alternative. Empirical results on decoder-only Llama models (1.1B and 3B) trained to 125B tokens show that xIELU and xIPReLU achieve lower perplexities than ReLU$^2$ and SwiGLU at matched compute, with xIELU delivering the strongest gains and adaptive depth-wise behavior. Overall, the work demonstrates the promise of gradient-focused activation design for large-scale language modeling, suggesting practical benefits for training efficiency and performance in LLMs.

Abstract

Our work proposes a novel approach to designing activation functions by focusing on their gradients and deriving the corresponding activation functions using integration. We introduce the Expanded Integral of the Exponential Linear Unit (xIELU), a trainable piecewise activation function derived by integrating trainable affine transformations applied to the Exponential Linear Unit (ELU). xIELU combines two key properties for the gradient: (1) a trainable and linearly increasing gradient for positive inputs, similar to Squared ReLU (ReLU$^2$), and (2) a trainable gradient that can take negative values for negative inputs, inspired by Expanded SiLU (xSiLU). Conceptually, xIELU can be viewed as an extension of ReLU$^2$ to handle negative inputs. The trainable parameters in xIELU allow it to adaptively reduce its nonlinearity for higher-level representations deeper in the network. In experiments with 1.1B and 3B parameter Llama models trained on 125B tokens of FineWeb Edu, xIELU achieves lower perplexity compared to popular activation functions like ReLU$^2$ and SwiGLU when matched for the same compute cost and parameter count. A reference implementation is available at https://github.com/Anonymous5823/xielu.

Deriving Activation Functions Using Integration

TL;DR

This work develops a gradient-centric paradigm for activation design by deriving activations through the integration of trainable gradient transforms. The resulting xIELU combines a linearly increasing positive-gradient (like ReLU) with a trainable negative-gradient (inspired by xSiLU), enabling adaptive nonlinearity across network depth; a companion xIPReLU variant offers a computationally lighter alternative. Empirical results on decoder-only Llama models (1.1B and 3B) trained to 125B tokens show that xIELU and xIPReLU achieve lower perplexities than ReLU and SwiGLU at matched compute, with xIELU delivering the strongest gains and adaptive depth-wise behavior. Overall, the work demonstrates the promise of gradient-focused activation design for large-scale language modeling, suggesting practical benefits for training efficiency and performance in LLMs.

Abstract

Our work proposes a novel approach to designing activation functions by focusing on their gradients and deriving the corresponding activation functions using integration. We introduce the Expanded Integral of the Exponential Linear Unit (xIELU), a trainable piecewise activation function derived by integrating trainable affine transformations applied to the Exponential Linear Unit (ELU). xIELU combines two key properties for the gradient: (1) a trainable and linearly increasing gradient for positive inputs, similar to Squared ReLU (ReLU), and (2) a trainable gradient that can take negative values for negative inputs, inspired by Expanded SiLU (xSiLU). Conceptually, xIELU can be viewed as an extension of ReLU to handle negative inputs. The trainable parameters in xIELU allow it to adaptively reduce its nonlinearity for higher-level representations deeper in the network. In experiments with 1.1B and 3B parameter Llama models trained on 125B tokens of FineWeb Edu, xIELU achieves lower perplexity compared to popular activation functions like ReLU and SwiGLU when matched for the same compute cost and parameter count. A reference implementation is available at https://github.com/Anonymous5823/xielu.

Paper Structure

This paper contains 27 sections, 12 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: Comparison of activation functions related to xIELU and their gradients. ELU is linearly increasing for positive inputs and is bounded below for negative inputs. ReLU$^2$ has a linearly increasing gradient for positive inputs and zero gradient for negative inputs. xSiLU introduces a trainable parameter $\alpha$ that controls the magnitude and range of negative-valued gradients by expanding the gradient limits of SiLU from $(0, 1)$ to $(-\alpha, 1+\alpha)$.
  • Figure 2: Visualization of xIELU and its gradients. The parameters $\alpha_{p}$ and $\alpha_{n}$ control the magnitude and range of the gradients. Larger values of either parameter increase the nonlinearity of xIELU. For the positive component, constraining $\alpha_{p}>0$ ensures a linearly increasing gradient. For the negative component, the gradient is bounded within the range $(\beta_{n}-\alpha_{n}, \beta_{n}]$ and constraining $\alpha_{n}>\beta_{n}$ ensures the presence of negative-valued gradients.
  • Figure 3: Perplexity comparison and parameter analysis of xIELU. (a) Perplexity for activation functions in 1.1B and 3B Llama models trained on 125B tokens. While xIELU initially shows higher perplexity, it progressively outperforms other functions as training continues. (b) Learned parameters $\alpha_{p}$ and $\alpha_{n}$ across normalized network depth (0 to 1). Both parameters decrease in deeper layers, suggesting xIELU adaptively reduces its nonlinearity for higher-level representations.
  • Figure 4: Adaptive behavior of xIELU across network depth in 1.1B model.
  • Figure 5: Adaptive behavior of xIELU across network depth in 3B model.
  • ...and 1 more figures