Stable and Robust Deep Learning By Hyperbolic Tangent Exponential Linear Unit (TeLU)
Alfredo Fernandez, Ankur Mali
TL;DR
TeLU introduces a novel activation $f(x) = x \cdot \tanh(e^{x})$ to address vanishing and exploding gradients and promote stable, fast convergence. The authors provide theoretical analysis showing nonzero gradients, controlled positive growth, negative-side saturation, implicit regularization with zero-mean activations, Lipschitz continuity, and a smoother optimization landscape via the Fisher Information Matrix, with convergence guarantees under the Polyak-Łojasiewicz condition. Empirically, TeLU outperforms ReLU, GELU, Mish, and other activations across CNNs on CIFAR-10, CIFAR-100, and TinyImageNet, exhibiting lower variance and robust performance across optimizers. The work suggests TeLU as a potential new standard for activations in deep networks, offering stability, efficiency, and improved convergence in diverse deep learning tasks.
Abstract
In this paper, we introduce the Hyperbolic Tangent Exponential Linear Unit (TeLU), a novel neural network activation function, represented as $f(x) = x{\cdot}tanh(e^x)$. TeLU is designed to overcome the limitations of conventional activation functions like ReLU, GELU, and Mish by addressing the vanishing and, to an extent, the exploding gradient problems. Our theoretical analysis and empirical assessments reveal that TeLU outperforms existing activation functions in stability and robustness, effectively adjusting activation outputs' mean towards zero for enhanced training stability and convergence. Extensive evaluations against popular activation functions (ReLU, GELU, SiLU, Mish, Logish, Smish) across advanced architectures, including Resnet-50, demonstrate TeLU's lower variance and superior performance, even under hyperparameter conditions optimized for other functions. In large-scale tests with challenging datasets like CIFAR-10, CIFAR-100, and TinyImageNet, encompassing 860 scenarios, TeLU consistently showcased its effectiveness, positioning itself as a potential new standard for neural network activation functions, boosting stability and performance in diverse deep learning applications.
