Table of Contents
Fetching ...

On the Mathematical Relationship Between Layer Normalization and Dynamic Activation Functions

Felix Stollenwerk

TL;DR

The paper establishes a theoretical link between Layer Normalization and dynamic activations by deriving Dynamic Tanh (DyT) from RMSNorm via a derivative-space decoupling, and then obtaining an exact element-wise counterpart, Dynamic Inverse Square Root Unit (DyISRU), through a function-space decoupling. DyISRU yields the exact RMSNorm-like transformation $y_i = \sqrt{C}\cdot\frac{x_i}{\sqrt{\beta + x_i^2}}$, while DyT remains an approximation using a global inverse-variance cue; both share the same output bounds $\pm\sqrt{C}$. Outlier simulations show DyISRU more accurately reproduces normalization behavior on outliers than DyT, supporting its closer alignment with normalization. The work offers a theoretical basis for dynamic activation functions as normalization surrogates and provides public code for reproduction, though it does not include empirical experiments on model performance. Overall, the findings clarify when and how dynamic activations can emulate normalization, with DyISRU representing a principled exact alternative.

Abstract

Layer normalization (LN) is an essential component of modern neural networks. While many alternative techniques have been proposed, none of them have succeeded in replacing LN so far. The latest suggestion in this line of research is a dynamic activation function called Dynamic Tanh (DyT). Although it is empirically well-motivated and appealing from a practical point of view, it lacks a theoretical foundation. In this work, we shed light on the mathematical relationship between LN and dynamic activation functions. In particular, we derive DyT from the LN variant RMSNorm, and show that a well-defined decoupling in derivative space as well as an approximation are needed to do so. By applying the same decoupling procedure directly in function space, we are able to omit the approximation and obtain the exact element-wise counterpart of RMSNorm, which we call Dynamic Inverse Square Root Unit (DyISRU). We demonstrate numerically that DyISRU reproduces the normalization effect on outliers more accurately than DyT does.

On the Mathematical Relationship Between Layer Normalization and Dynamic Activation Functions

TL;DR

The paper establishes a theoretical link between Layer Normalization and dynamic activations by deriving Dynamic Tanh (DyT) from RMSNorm via a derivative-space decoupling, and then obtaining an exact element-wise counterpart, Dynamic Inverse Square Root Unit (DyISRU), through a function-space decoupling. DyISRU yields the exact RMSNorm-like transformation , while DyT remains an approximation using a global inverse-variance cue; both share the same output bounds . Outlier simulations show DyISRU more accurately reproduces normalization behavior on outliers than DyT, supporting its closer alignment with normalization. The work offers a theoretical basis for dynamic activation functions as normalization surrogates and provides public code for reproduction, though it does not include empirical experiments on model performance. Overall, the findings clarify when and how dynamic activations can emulate normalization, with DyISRU representing a principled exact alternative.

Abstract

Layer normalization (LN) is an essential component of modern neural networks. While many alternative techniques have been proposed, none of them have succeeded in replacing LN so far. The latest suggestion in this line of research is a dynamic activation function called Dynamic Tanh (DyT). Although it is empirically well-motivated and appealing from a practical point of view, it lacks a theoretical foundation. In this work, we shed light on the mathematical relationship between LN and dynamic activation functions. In particular, we derive DyT from the LN variant RMSNorm, and show that a well-defined decoupling in derivative space as well as an approximation are needed to do so. By applying the same decoupling procedure directly in function space, we are able to omit the approximation and obtain the exact element-wise counterpart of RMSNorm, which we call Dynamic Inverse Square Root Unit (DyISRU). We demonstrate numerically that DyISRU reproduces the normalization effect on outliers more accurately than DyT does.

Paper Structure

This paper contains 17 sections, 49 equations, 3 figures.

Figures (3)

  • Figure 1: Illustration of how to obtain the dynamic activation functions DyT (red) and DyISRU (blue) from RMSNorm (black). The labels T1, T2, T3 indicate the application of our theorems. The dashed line differentiates between function space ($y_i$) above and derivative space ($\frac{\partial y_i}{\partial x_j}$) below.
  • Figure 2: Functions DyT from Eq. (\ref{['eq:dyt']}) and DyISRU from Eq. (\ref{['eq:dyisru']}) with parameters $\alpha = 0.05$ and $\beta = 400$ such that the derivatives at $x=0$, namely $\alpha$ and $1/\sqrt{\beta}$, match. The dotted lines correspond to the extrema $y = \pm \sqrt{C}$.
  • Figure 3: Top: Stepwise outlier simulation. The sample $x$ and is plotted against its normalized counterpart $y$, with outliers of different degrees (filled circles) as defined by Eq. (\ref{['eq:outliers']}). Center: Functions DyT and DyISRU with optimal parameters $\alpha$ and $\beta$, respectively, fitted on the outliers shown as colored, filled circles. The non-outlier data are shown as gray, empty circles. Bottom: Residuals of the functions DyT and DyISRU with respect to the outlier data. As the residuals are antisymmetric (like the data and the functions), only positive outliers are displayed for the sake of simplicity.

Theorems & Definitions (3)

  • proof
  • proof
  • proof