dynActivation: A Trainable Activation Family for Adaptive Nonlinearity

Alois Bachmann

dynActivation: A Trainable Activation Family for Adaptive Nonlinearity

Alois Bachmann

Abstract

This paper proposes $\mathrm{dynActivation}$, a per-layer trainable activation defined as $f_i(x) = \mathrm{BaseAct}(x)(α_i - β_i) + β_i x$, where $α_i$ and $β_i$ are lightweight learned scalars that interpolate between the base nonlinearity and a linear path and $\mathrm{BaseAct}(x)$ resembles any ReLU-like function. The static and dynamic ReLU-like variants are then compared across multiple vision tasks, language modeling tasks, and ablation studies. The results suggest that dynActivation variants tend to linearize deep layers while maintaining high performance, which can improve training efficiency by up to $+54\%$ over ReLU. On CIFAR-10, dynActivation(Mish) improves over static Mish by up to $+14.02\%$ on AttentionCNN with an average improvment by $+6.00\%$, with a $24\%$ convergence-AUC reduction relative to Mish (2120 vs. 2785). In a 1-to-75-layer MNIST depth-scaling study, dynActivation never drops below $95\%$ test accuracy ($95.3$--$99.3\%$), while ReLU collapses below $80\%$ at 25 layers. Under FGSM at $\varepsilon{=}0.08$, dynActivation(Mish) incurs a $55.39\%$ accuracy drop versus $62.79\%$ for ReLU ($7.40\%$ advantage). Transferred to language modeling, a new proposed dynActGLU-variant achieves a $10.3\%$ relative perplexity reduction over SwiGLU at 5620 steps (4.047 vs. 4.514), though the gap vanishes at 34300 steps.

dynActivation: A Trainable Activation Family for Adaptive Nonlinearity

Abstract

This paper proposes

, a per-layer trainable activation defined as

, where

and

are lightweight learned scalars that interpolate between the base nonlinearity and a linear path and

resembles any ReLU-like function. The static and dynamic ReLU-like variants are then compared across multiple vision tasks, language modeling tasks, and ablation studies. The results suggest that dynActivation variants tend to linearize deep layers while maintaining high performance, which can improve training efficiency by up to

over ReLU. On CIFAR-10, dynActivation(Mish) improves over static Mish by up to

on AttentionCNN with an average improvment by

, with a

convergence-AUC reduction relative to Mish (2120 vs. 2785). In a 1-to-75-layer MNIST depth-scaling study, dynActivation never drops below

test accuracy (

), while ReLU collapses below

at 25 layers. Under FGSM at

, dynActivation(Mish) incurs a

accuracy drop versus

for ReLU (

advantage). Transferred to language modeling, a new proposed dynActGLU-variant achieves a

relative perplexity reduction over SwiGLU at 5620 steps (4.047 vs. 4.514), though the gap vanishes at 34300 steps.

Paper Structure (46 sections, 4 equations, 16 figures, 10 tables)

This paper contains 46 sections, 4 equations, 16 figures, 10 tables.

Introduction
Motivation
Why trainable activations
Local intuition figure pair
dynActivation
Derivation
Derivation:
Parameter interpretation
Gradient and parameter analysis
Gradient-flow / Lipschitz figure pair
Experimental Setup
Protocol
Architectures and datasets
CIFAR Validation
Environment Setup
...and 31 more sections

Figures (16)

Figure 1: Motivational view of trainable activations.
Figure 2: Schematic view of the dynActivation family and its shape deformation relative to the base activation.
Figure 3: Gradient flow and Lipschitz behavior of dynActivation.
Figure 4: Accuracy-versus-Loss trade-off.
Figure 5: MNIST depth scaling figure pair.
...and 11 more figures

dynActivation: A Trainable Activation Family for Adaptive Nonlinearity

Abstract

dynActivation: A Trainable Activation Family for Adaptive Nonlinearity

Authors

Abstract

Table of Contents

Figures (16)