Swish-T : Enhancing Swish Activation with Tanh Bias for Improved Neural Network Performance
Youngmin Seo, Jinha Kim, Unsang Park
TL;DR
Swish-T introduces an adaptive tanh bias to the Swish activation, defined as $f(x; \beta, \alpha) = x \sigma(\beta x) + \alpha \tanh(x)$, and develops three variants (Swish-T_A, Swish-T_B, Swish-T_C) with distinct efficiency, adaptability, and stability properties. The authors derive gradients for backpropagation and demonstrate superior or competitive performance across MNIST, Fashion-MNIST, SVHN, CIFAR-10, and CIFAR-100, with Swish-T_C often achieving the best results. Ablation studies show that non-parametric versions (fixed beta) can still deliver high accuracy and faster training, highlighting practical applicability. Overall, Swish-T offers a robust, versatile activation family that improves upon Swish and several baselines while maintaining efficient training dynamics.
Abstract
We propose the Swish-T family, an enhancement of the existing non-monotonic activation function Swish. Swish-T is defined by adding a Tanh bias to the original Swish function. This modification creates a family of Swish-T variants, each designed to excel in different tasks, showcasing specific advantages depending on the application context. The Tanh bias allows for broader acceptance of negative values during initial training stages, offering a smoother non-monotonic curve than the original Swish. We ultimately propose the Swish-T$_{\textbf{C}}$ function, while Swish-T and Swish-T$_{\textbf{B}}$, byproducts of Swish-T$_{\textbf{C}}$, also demonstrate satisfactory performance. Furthermore, our ablation study shows that using Swish-T$_{\textbf{C}}$ as a non-parametric function can still achieve high performance. The superiority of the Swish-T family has been empirically demonstrated across various models and benchmark datasets, including MNIST, Fashion MNIST, SVHN, CIFAR-10, and CIFAR-100. The code is publicly available at https://github.com/ictseoyoungmin/Swish-T-pytorch.
