Table of Contents
Fetching ...

λ-GELU: Learning Gating Hardness for Controlled ReLU-ization in Deep Networks

Cristian Pérez-Corral, Alberto Fernández-Hernández, Jose I. Mestre, Manuel F. Dolz, Enrique S. Quintana-Ortí

Abstract

Gaussian Error Linear Unit (GELU) is a widely used smooth alternative to Rectifier Linear Unit (ReLU), yet many deployment, compression, and analysis toolchains are most naturally expressed for piecewise-linear (ReLU-type) networks. We study a hardness-parameterized formulation of GELU, f(x;λ)=xΦ(λ x), where Φ is the Gaussian CDF and λ \in [1, infty) controls gate sharpness, with the goal of turning smooth gated training into a controlled path toward ReLU-compatible models. Learning λ is non-trivial: naive updates yield unstable dynamics and effective gradient attenuation, so we introduce a constrained reparameterization and an optimizer-aware update scheme. Empirically, across a diverse set of model--dataset pairs spanning MLPs, CNNs, and Transformers, we observe structured layerwise hardness profiles and assess their robustness under different initializations. We further study a deterministic ReLU-ization strategy in which the learned gates are progressively hardened toward a principled target, enabling a post-training substitution of λ-GELU by ReLU with reduced disruption. Overall, λ-GELU provides a minimal and interpretable knob to profile and control gating hardness, bridging smooth training with ReLU-centric downstream pipelines.

λ-GELU: Learning Gating Hardness for Controlled ReLU-ization in Deep Networks

Abstract

Gaussian Error Linear Unit (GELU) is a widely used smooth alternative to Rectifier Linear Unit (ReLU), yet many deployment, compression, and analysis toolchains are most naturally expressed for piecewise-linear (ReLU-type) networks. We study a hardness-parameterized formulation of GELU, f(x;λ)=xΦ(λ x), where Φ is the Gaussian CDF and λ \in [1, infty) controls gate sharpness, with the goal of turning smooth gated training into a controlled path toward ReLU-compatible models. Learning λ is non-trivial: naive updates yield unstable dynamics and effective gradient attenuation, so we introduce a constrained reparameterization and an optimizer-aware update scheme. Empirically, across a diverse set of model--dataset pairs spanning MLPs, CNNs, and Transformers, we observe structured layerwise hardness profiles and assess their robustness under different initializations. We further study a deterministic ReLU-ization strategy in which the learned gates are progressively hardened toward a principled target, enabling a post-training substitution of λ-GELU by ReLU with reduced disruption. Overall, λ-GELU provides a minimal and interpretable knob to profile and control gating hardness, bridging smooth training with ReLU-centric downstream pipelines.
Paper Structure (8 sections, 13 equations, 3 figures, 2 tables, 1 algorithm)

This paper contains 8 sections, 13 equations, 3 figures, 2 tables, 1 algorithm.

Figures (3)

  • Figure 1: Grid over temperature $t$ and the $s$ learning-rate multiplier $c$ on mlp/FMNIST. Cell annotations report the mean validation-score change with respect to gelu (averaged over the three $s$ initializations), while the color encodes the mean hardness-drift proxy $\Delta\lambda$ (average absolute epoch-to-epoch change of the learned layerwise hardness). Low temperatures and larger $c$ induce stronger hardness adaptation.
  • Figure 2: Sweep over the $s$ learning-rate multiplier $c$ on ResNet-18/CIFAR-100 (AdamW, $t{=}0.1$ fixed). Cell annotations report the mean validation-score change with respect to gelu (averaged over the three $s$ initializations), while the color encodes the mean hardness-drift proxy $\Delta\lambda$ (average absolute epoch-to-epoch change of the learned layerwise hardness). Larger $c$ monotonically increases $\Delta\lambda$ while $\Delta\mathrm{BVS}$ remains small across the sweep; $c{=}1$ achieves the best mean $\Delta\mathrm{BVS}$ but $c{=}9$ is comparable while maximizing hardness adaptation.
  • Figure 3: Validation-metric trajectories (left) and Spearman rank correlations between layerwise hardness profiles across $s$-initialization modes (right). Vertical lines mark the epoch of best validation score (BVS).