Table of Contents
Fetching ...

Stable and Robust Deep Learning By Hyperbolic Tangent Exponential Linear Unit (TeLU)

Alfredo Fernandez, Ankur Mali

TL;DR

TeLU introduces a novel activation $f(x) = x \cdot \tanh(e^{x})$ to address vanishing and exploding gradients and promote stable, fast convergence. The authors provide theoretical analysis showing nonzero gradients, controlled positive growth, negative-side saturation, implicit regularization with zero-mean activations, Lipschitz continuity, and a smoother optimization landscape via the Fisher Information Matrix, with convergence guarantees under the Polyak-Łojasiewicz condition. Empirically, TeLU outperforms ReLU, GELU, Mish, and other activations across CNNs on CIFAR-10, CIFAR-100, and TinyImageNet, exhibiting lower variance and robust performance across optimizers. The work suggests TeLU as a potential new standard for activations in deep networks, offering stability, efficiency, and improved convergence in diverse deep learning tasks.

Abstract

In this paper, we introduce the Hyperbolic Tangent Exponential Linear Unit (TeLU), a novel neural network activation function, represented as $f(x) = x{\cdot}tanh(e^x)$. TeLU is designed to overcome the limitations of conventional activation functions like ReLU, GELU, and Mish by addressing the vanishing and, to an extent, the exploding gradient problems. Our theoretical analysis and empirical assessments reveal that TeLU outperforms existing activation functions in stability and robustness, effectively adjusting activation outputs' mean towards zero for enhanced training stability and convergence. Extensive evaluations against popular activation functions (ReLU, GELU, SiLU, Mish, Logish, Smish) across advanced architectures, including Resnet-50, demonstrate TeLU's lower variance and superior performance, even under hyperparameter conditions optimized for other functions. In large-scale tests with challenging datasets like CIFAR-10, CIFAR-100, and TinyImageNet, encompassing 860 scenarios, TeLU consistently showcased its effectiveness, positioning itself as a potential new standard for neural network activation functions, boosting stability and performance in diverse deep learning applications.

Stable and Robust Deep Learning By Hyperbolic Tangent Exponential Linear Unit (TeLU)

TL;DR

TeLU introduces a novel activation to address vanishing and exploding gradients and promote stable, fast convergence. The authors provide theoretical analysis showing nonzero gradients, controlled positive growth, negative-side saturation, implicit regularization with zero-mean activations, Lipschitz continuity, and a smoother optimization landscape via the Fisher Information Matrix, with convergence guarantees under the Polyak-Łojasiewicz condition. Empirically, TeLU outperforms ReLU, GELU, Mish, and other activations across CNNs on CIFAR-10, CIFAR-100, and TinyImageNet, exhibiting lower variance and robust performance across optimizers. The work suggests TeLU as a potential new standard for activations in deep networks, offering stability, efficiency, and improved convergence in diverse deep learning tasks.

Abstract

In this paper, we introduce the Hyperbolic Tangent Exponential Linear Unit (TeLU), a novel neural network activation function, represented as . TeLU is designed to overcome the limitations of conventional activation functions like ReLU, GELU, and Mish by addressing the vanishing and, to an extent, the exploding gradient problems. Our theoretical analysis and empirical assessments reveal that TeLU outperforms existing activation functions in stability and robustness, effectively adjusting activation outputs' mean towards zero for enhanced training stability and convergence. Extensive evaluations against popular activation functions (ReLU, GELU, SiLU, Mish, Logish, Smish) across advanced architectures, including Resnet-50, demonstrate TeLU's lower variance and superior performance, even under hyperparameter conditions optimized for other functions. In large-scale tests with challenging datasets like CIFAR-10, CIFAR-100, and TinyImageNet, encompassing 860 scenarios, TeLU consistently showcased its effectiveness, positioning itself as a potential new standard for neural network activation functions, boosting stability and performance in diverse deep learning applications.
Paper Structure (24 sections, 10 theorems, 32 equations, 10 figures, 48 tables)

This paper contains 24 sections, 10 theorems, 32 equations, 10 figures, 48 tables.

Key Result

Theorem 1

If f(x) = $x \cdot \tanh(e^x)$, then it avoids gradient vanishing problem since $f'(x) \neq 0 \text{ for all } x \in \mathbb{R}$.

Figures (10)

  • Figure 1: The characteristic of the TeLU activation function along with ReLU, GELU and Mish.
  • Figure 2: The first and second derivative of proposed TeLU activation compared to derivatives of GELU and Mish
  • Figure 3: ReLU Loss Landscape
  • Figure 4: TeLU Loss Landscape
  • Figure 5: Validation performance comparison of 7 activation functions per epoch on CIFAR-10 using SqueezeNet-SGD
  • ...and 5 more figures

Theorems & Definitions (10)

  • Theorem 1
  • Theorem 2
  • Theorem 3
  • Theorem 4
  • Theorem 5
  • Theorem 6
  • Theorem 7
  • Theorem 8
  • Theorem 9
  • Theorem 10