Swish-T : Enhancing Swish Activation with Tanh Bias for Improved Neural Network Performance

Youngmin Seo; Jinha Kim; Unsang Park

Swish-T : Enhancing Swish Activation with Tanh Bias for Improved Neural Network Performance

Youngmin Seo, Jinha Kim, Unsang Park

TL;DR

Swish-T introduces an adaptive tanh bias to the Swish activation, defined as $f(x; \beta, \alpha) = x \sigma(\beta x) + \alpha \tanh(x)$, and develops three variants (Swish-T_A, Swish-T_B, Swish-T_C) with distinct efficiency, adaptability, and stability properties. The authors derive gradients for backpropagation and demonstrate superior or competitive performance across MNIST, Fashion-MNIST, SVHN, CIFAR-10, and CIFAR-100, with Swish-T_C often achieving the best results. Ablation studies show that non-parametric versions (fixed beta) can still deliver high accuracy and faster training, highlighting practical applicability. Overall, Swish-T offers a robust, versatile activation family that improves upon Swish and several baselines while maintaining efficient training dynamics.

Abstract

We propose the Swish-T family, an enhancement of the existing non-monotonic activation function Swish. Swish-T is defined by adding a Tanh bias to the original Swish function. This modification creates a family of Swish-T variants, each designed to excel in different tasks, showcasing specific advantages depending on the application context. The Tanh bias allows for broader acceptance of negative values during initial training stages, offering a smoother non-monotonic curve than the original Swish. We ultimately propose the Swish-T$_{\textbf{C}}$ function, while Swish-T and Swish-T$_{\textbf{B}}$, byproducts of Swish-T$_{\textbf{C}}$, also demonstrate satisfactory performance. Furthermore, our ablation study shows that using Swish-T$_{\textbf{C}}$ as a non-parametric function can still achieve high performance. The superiority of the Swish-T family has been empirically demonstrated across various models and benchmark datasets, including MNIST, Fashion MNIST, SVHN, CIFAR-10, and CIFAR-100. The code is publicly available at https://github.com/ictseoyoungmin/Swish-T-pytorch.

Swish-T : Enhancing Swish Activation with Tanh Bias for Improved Neural Network Performance

TL;DR

Swish-T introduces an adaptive tanh bias to the Swish activation, defined as

, and develops three variants (Swish-T_A, Swish-T_B, Swish-T_C) with distinct efficiency, adaptability, and stability properties. The authors derive gradients for backpropagation and demonstrate superior or competitive performance across MNIST, Fashion-MNIST, SVHN, CIFAR-10, and CIFAR-100, with Swish-T_C often achieving the best results. Ablation studies show that non-parametric versions (fixed beta) can still deliver high accuracy and faster training, highlighting practical applicability. Overall, Swish-T offers a robust, versatile activation family that improves upon Swish and several baselines while maintaining efficient training dynamics.

Abstract

function, while Swish-T and Swish-T

, byproducts of Swish-T

, also demonstrate satisfactory performance. Furthermore, our ablation study shows that using Swish-T

as a non-parametric function can still achieve high performance. The superiority of the Swish-T family has been empirically demonstrated across various models and benchmark datasets, including MNIST, Fashion MNIST, SVHN, CIFAR-10, and CIFAR-100. The code is publicly available at https://github.com/ictseoyoungmin/Swish-T-pytorch.

Paper Structure (11 sections, 11 equations, 4 figures, 5 tables)

This paper contains 11 sections, 11 equations, 4 figures, 5 tables.

Introduction
Related work and Motivation
Swish-T
Swish-T Family
Gradient Computation for Swish-T Family
Experiments and Results
MNIST, Fashion MNIST and SVHN
CIFAR
Ablation Study
Effect of Beta Fixation on ResNet-18 with Swish-T Family
Conclusion

Figures (4)

Figure 1: Comparison of Various Activation Functions including Sigmoid, ReLU, Leaky ReLU, GELU, Swish, Mish, SMU, SMU-1, Swish-T$_{\textbf{C}}$, and Identity in the Output Landscape of a 3-Layer Neural Network. All the networks' weights are randomly initialized, and no training has been performed, showcasing the initial output patterns induced by each activation function.
Figure 2: Swish-T$_{\textbf{C}}$, Swish activation function and first derivatives. (a) Swish-T$_{\textbf{C}}$ activation function with fixed alpha and beta. (b) The first derivatives with fixed alpha$=$0.5 and different betas. Beta controls how quickly the first derivative reaches the upper/lower asymptotes. (c) Alpha determines the upper/lower bounds of the first derivative.
Figure 3: Train and test curves for ShuffleNetv2 (2.x) on the CIFAR100 dataset. This figure shows the comparison of the performance metrics (Top-1 accuracy and loss) between the Swish-T family and other activation functions. The shaded areas represent the standard deviation.
Figure 4: Average training time for SENet-18 and DenseNet-121 on CIFAR-10 using a single GPU. (Performance metrics can be found in Table \ref{['tab:cifar10']}.)

Swish-T : Enhancing Swish Activation with Tanh Bias for Improved Neural Network Performance

TL;DR

Abstract

Swish-T : Enhancing Swish Activation with Tanh Bias for Improved Neural Network Performance

Authors

TL;DR

Abstract

Table of Contents

Figures (4)