A Significantly Better Class of Activation Functions Than ReLU Like Activation Functions

Mathew Mithra Noel; Yug Oswal

A Significantly Better Class of Activation Functions Than ReLU Like Activation Functions

Mathew Mithra Noel, Yug Oswal

TL;DR

The paper questions whether activation functions beyond ReLU-like and sigmoidal forms can yield better decision boundaries. It introduces Cone and Parabolic-Cone activations, defined by $g(z)=1-|z-1|$ and $g(z)=z(2-z)$, and a parametric extension $g(z)=β-|z-γ|^α$, which induce a hyper-strip positive region $C_+ = {0 < w^T x + b < δ}$ and two hyperplane boundaries at $z=0$ and $z=δ$. Empirically, these activations achieve higher accuracies than ReLU or sigmoid on CIFAR-10 and Imagenette with significantly fewer neurons, and training is faster due to larger derivative values. This suggests many nonlinear real-world datasets may be effectively separated using fewer hyper-strips than half-spaces, motivating a shift toward hyper-strip based decision boundaries in neural networks.

Abstract

This paper introduces a significantly better class of activation functions than the almost universally used ReLU like and Sigmoidal class of activation functions. Two new activation functions referred to as the Cone and Parabolic-Cone that differ drastically from popular activation functions and significantly outperform these on the CIFAR-10 and Imagenette benchmmarks are proposed. The cone activation functions are positive only on a finite interval and are strictly negative except at the end-points of the interval, where they become zero. Thus the set of inputs that produce a positive output for a neuron with cone activation functions is a hyperstrip and not a half-space as is the usual case. Since a hyper strip is the region between two parallel hyper-planes, it allows neurons to more finely divide the input feature space into positive and negative classes than with infinitely wide half-spaces. In particular the XOR function can be learn by a single neuron with cone-like activation functions. Both the cone and parabolic-cone activation functions are shown to achieve higher accuracies with significantly fewer neurons on benchmarks. The results presented in this paper indicate that many nonlinear real-world datasets may be separated with fewer hyperstrips than half-spaces. The Cone and Parabolic-Cone activation functions have larger derivatives than ReLU and are shown to significantly speedup training.

A Significantly Better Class of Activation Functions Than ReLU Like Activation Functions

TL;DR

The paper questions whether activation functions beyond ReLU-like and sigmoidal forms can yield better decision boundaries. It introduces Cone and Parabolic-Cone activations, defined by

and

, and a parametric extension

, which induce a hyper-strip positive region

and two hyperplane boundaries at

and

. Empirically, these activations achieve higher accuracies than ReLU or sigmoid on CIFAR-10 and Imagenette with significantly fewer neurons, and training is faster due to larger derivative values. This suggests many nonlinear real-world datasets may be effectively separated using fewer hyper-strips than half-spaces, motivating a shift toward hyper-strip based decision boundaries in neural networks.

Abstract

Paper Structure (5 sections, 5 equations, 10 figures, 7 tables)

This paper contains 5 sections, 5 equations, 10 figures, 7 tables.

Introduction
Nature of neuronal decision boundaries
Halfspaces versus Hyper-strips
Results: Performance comparison on benchmark datasets
Conclusion

Figures (10)

Figure 1: Comparison of ReLU with Cone and Parabolic-Cone activation functions. The set of inputs that provide a strictly positive output for Cone and Parabolic-Cone activation functions is a finite interval $(0,2)$ as apposed to $(0,\infty)$ for ReLU.
Figure 2: A Comparison of the first derivatives of different activation functions. Cone-like activation functions never saturate and have larger derivative values for most inputs.
Figure 3: Variation in the shape of the Parameterized-Cone activation with parameter $\beta$.
Figure 4: Only two hyper-strips are needed to accurately partition this dataset. Two neurons with Cone or Parabolic-Cone can be used to learn the 2 hyper-strips. However 4 ReLU or sigmoidal neurons will be needed to learn 4 hyperplane boundaries.
Figure 5: The classic XOR problem can be solved with a single neuron with Cone activation, since $C_+$ is a hyper-strip for Cone-like neurons.
...and 5 more figures

A Significantly Better Class of Activation Functions Than ReLU Like Activation Functions

TL;DR

Abstract

A Significantly Better Class of Activation Functions Than ReLU Like Activation Functions

Authors

TL;DR

Abstract

Table of Contents

Figures (10)