Trainable Highly-expressive Activation Functions

Irit Chelly; Shahaf E. Finder; Shira Ifergane; Oren Freifeld

Trainable Highly-expressive Activation Functions

Irit Chelly, Shahaf E. Finder, Shira Ifergane, Oren Freifeld

TL;DR

Fixed activation functions constrain expressiveness and can bias learning; the authors propose DiTAC, a trainable activation built from CPAB highly expressive diffeomorphisms. DiTAC defines a GELU-like activation with $DiTAC(x)=\tilde{x}\cdot\Phi(x)$ and $\tilde{x}=T^{\theta}(x)$ on a user-defined interval $[a,b]$, extended by variants such as Leaky-DiTAC and GE-DiTAC, along with a regularization term on the CPA velocity fields. Computational cost is mitigated by a quantization/lookup-table approach and a Straight-Through Estimator for gradients, plus a regularization strategy $\mathcal{L}_{\mathrm{reg}}$ to stabilize training. Across toy tasks, real-world classification, semantic segmentation, and image generation, DiTAC consistently outperforms fixed AFs and existing trainable AFs with only a small parameter overhead, and code is publicly available for reproduction.

Abstract

Nonlinear activation functions are pivotal to the success of deep neural nets, and choosing the appropriate activation function can significantly affect their performance. Most networks use fixed activation functions (e.g., ReLU, GELU, etc.), and this choice might limit their expressiveness. Furthermore, different layers may benefit from diverse activation functions. Consequently, there has been a growing interest in trainable activation functions. In this paper, we introduce DiTAC, a trainable highly-expressive activation function based on an efficient diffeomorphic transformation (called CPAB). Despite introducing only a negligible number of trainable parameters, DiTAC enhances model expressiveness and performance, often yielding substantial improvements. It also outperforms existing activation functions (regardless whether the latter are fixed or trainable) in tasks such as semantic segmentation, image generation, regression problems, and image classification. Our code is available at https://github.com/BGU-CS-VIL/DiTAC.

Trainable Highly-expressive Activation Functions

TL;DR

and

on a user-defined interval

, extended by variants such as Leaky-DiTAC and GE-DiTAC, along with a regularization term on the CPA velocity fields. Computational cost is mitigated by a quantization/lookup-table approach and a Straight-Through Estimator for gradients, plus a regularization strategy

to stabilize training. Across toy tasks, real-world classification, semantic segmentation, and image generation, DiTAC consistently outperforms fixed AFs and existing trainable AFs with only a small parameter overhead, and code is publicly available for reproduction.

Abstract

Paper Structure (36 sections, 11 equations, 13 figures, 10 tables)

This paper contains 36 sections, 11 equations, 13 figures, 10 tables.

Introduction
Related Work
Activation Functions
CPAB transformations in Deep Learning
Method
Preliminaries: 1D CPAB Transformations
The DiTAC Activation Function
How to Drastically Reduce the Computational Cost
Results
Toy Data
Classification.
Regression.
Real-World Data
Small-scale Classification Experiment.
Classification.
...and 21 more sections

Figures (13)

Figure 1: Regression-task results of reconstructing a two-dimensional function via a simple MLP, using either DiTAC or (the runner-up) PReLU. Due to its expressiveness, DiTAC manages to fit a smooth function, yielding an evidently-better reconstruction.
Figure 2: The CPAB transformation effect when it is applied on each axis. In each panel we use a different $T^\btheta$. When the $x$ axis is transformed (i.e., $f(T^\btheta(x))$), the intensity values are unchanged and only the $x$ values are shifted. When the $y$ axis is transformed ($T^\btheta(f(x))$), the intensity is changed while peaks and valleys' locations are kept.
Figure 3: DiTAC's expressiveness reflected in a 3-node hidden-layer regression network. The first three rows match the three hidden nodes. Two left columns (blue): each node's value before and after the ReLU function. Two right columns (green): each node's value before and after a DiTAC function. Bottom row: learned 1D regression using ReLU (blue) versus using DiTAC (green).
Figure 4: Here we display (a) $\bv^\btheta$, a CPA velocity field, (b) DiTAC (solid line) versus GELU (dashed line), and (c) a Leaky DiTAC (solid line) versus the identity function (dashed line). The two DiTAC functions were derived from the CPAB transformation $T^\btheta$ that corresponds to $\bv^\btheta$ in (a).
Figure 5: Decision boundaries learned by DiTAC, Swish, and GELU in a 2D-GMM classification, along with test-batch data points colored by their ground-truth classes. Evidently, DiTAC learns more accurate boundaries and identifies more classes.
...and 8 more figures

Trainable Highly-expressive Activation Functions

TL;DR

Abstract

Trainable Highly-expressive Activation Functions

Authors

TL;DR

Abstract

Table of Contents

Figures (13)