A Significantly Better Class of Activation Functions Than ReLU Like Activation Functions
Mathew Mithra Noel, Yug Oswal
TL;DR
The paper questions whether activation functions beyond ReLU-like and sigmoidal forms can yield better decision boundaries. It introduces Cone and Parabolic-Cone activations, defined by $g(z)=1-|z-1|$ and $g(z)=z(2-z)$, and a parametric extension $g(z)=β-|z-γ|^α$, which induce a hyper-strip positive region $C_+ = {0 < w^T x + b < δ}$ and two hyperplane boundaries at $z=0$ and $z=δ$. Empirically, these activations achieve higher accuracies than ReLU or sigmoid on CIFAR-10 and Imagenette with significantly fewer neurons, and training is faster due to larger derivative values. This suggests many nonlinear real-world datasets may be effectively separated using fewer hyper-strips than half-spaces, motivating a shift toward hyper-strip based decision boundaries in neural networks.
Abstract
This paper introduces a significantly better class of activation functions than the almost universally used ReLU like and Sigmoidal class of activation functions. Two new activation functions referred to as the Cone and Parabolic-Cone that differ drastically from popular activation functions and significantly outperform these on the CIFAR-10 and Imagenette benchmmarks are proposed. The cone activation functions are positive only on a finite interval and are strictly negative except at the end-points of the interval, where they become zero. Thus the set of inputs that produce a positive output for a neuron with cone activation functions is a hyperstrip and not a half-space as is the usual case. Since a hyper strip is the region between two parallel hyper-planes, it allows neurons to more finely divide the input feature space into positive and negative classes than with infinitely wide half-spaces. In particular the XOR function can be learn by a single neuron with cone-like activation functions. Both the cone and parabolic-cone activation functions are shown to achieve higher accuracies with significantly fewer neurons on benchmarks. The results presented in this paper indicate that many nonlinear real-world datasets may be separated with fewer hyperstrips than half-spaces. The Cone and Parabolic-Cone activation functions have larger derivatives than ReLU and are shown to significantly speedup training.
