Table of Contents
Fetching ...

VeLU: Variance-enhanced Learning Unit for Deep Neural Networks

Ashkan Shakarami, Yousef Yeganeh, Azade Farshad, Lorenzo Nicolè, Stefano Ghidoni, Nassir Navab

TL;DR

VeLU introduces a variance-aware activation by coupling ArcTan-ArcSin nonlinearities with a variance-adaptive scaling and Wasserstein-2 regularization to modulate activations based on local statistics without adding trainable layers. This design directly mitigates internal covariate shift at the activation level, improving gradient flow, training stability, and generalization across CNNs and ViTs. Extensive experiments across six architectures and 12 vision benchmarks show consistent improvements over ReLU, Swish, and GELU, with robust performance across resolutions and optimizers and only a single learnable parameter. The work provides practical guidelines for parameter ranges and outlines limitations and future directions, with public implementation available on GitHub, highlighting VeLU as a lightweight, architecture-agnostic activation alternative.

Abstract

Activation functions play a critical role in deep neural networks by shaping gradient flow, optimization stability, and generalization. While ReLU remains widely used due to its simplicity, it suffers from gradient sparsity and dead-neuron issues and offers no adaptivity to input statistics. Smooth alternatives such as Swish and GELU improve gradient propagation but still apply a fixed transformation regardless of the activation distribution. In this paper, we propose VeLU, a Variance-enhanced Learning Unit that introduces variance-aware and distributionally aligned nonlinearity through a principled combination of ArcTan-ArcSin transformations, adaptive scaling, and Wasserstein-2 regularization (Optimal Transport). This design enables VeLU to modulate its response based on local activation variance, mitigate internal covariate shift at the activation level, and improve training stability without adding learnable parameters or architectural overhead. Extensive experiments across six deep neural networks show that VeLU outperforms ReLU, ReLU6, Swish, and GELU on 12 vision benchmarks. The implementation of VeLU is publicly available in GitHub.

VeLU: Variance-enhanced Learning Unit for Deep Neural Networks

TL;DR

VeLU introduces a variance-aware activation by coupling ArcTan-ArcSin nonlinearities with a variance-adaptive scaling and Wasserstein-2 regularization to modulate activations based on local statistics without adding trainable layers. This design directly mitigates internal covariate shift at the activation level, improving gradient flow, training stability, and generalization across CNNs and ViTs. Extensive experiments across six architectures and 12 vision benchmarks show consistent improvements over ReLU, Swish, and GELU, with robust performance across resolutions and optimizers and only a single learnable parameter. The work provides practical guidelines for parameter ranges and outlines limitations and future directions, with public implementation available on GitHub, highlighting VeLU as a lightweight, architecture-agnostic activation alternative.

Abstract

Activation functions play a critical role in deep neural networks by shaping gradient flow, optimization stability, and generalization. While ReLU remains widely used due to its simplicity, it suffers from gradient sparsity and dead-neuron issues and offers no adaptivity to input statistics. Smooth alternatives such as Swish and GELU improve gradient propagation but still apply a fixed transformation regardless of the activation distribution. In this paper, we propose VeLU, a Variance-enhanced Learning Unit that introduces variance-aware and distributionally aligned nonlinearity through a principled combination of ArcTan-ArcSin transformations, adaptive scaling, and Wasserstein-2 regularization (Optimal Transport). This design enables VeLU to modulate its response based on local activation variance, mitigate internal covariate shift at the activation level, and improve training stability without adding learnable parameters or architectural overhead. Extensive experiments across six deep neural networks show that VeLU outperforms ReLU, ReLU6, Swish, and GELU on 12 vision benchmarks. The implementation of VeLU is publicly available in GitHub.

Paper Structure

This paper contains 33 sections, 13 equations, 5 figures, 7 tables.

Figures (5)

  • Figure 1: Comparison of VeLU with standard activation functions (ReLU, Swish, GELU) in the range $[-1.5, 1.5]$. The plot highlights smoothness, non-linearity, and the adaptive behavior introduced by VeLU.
  • Figure 2: Behavior of VeLU under different parameter configurations. The curves illustrate how curvature, smoothness, and variance-adaptive modulation emerge from the ArcTan-ArcSin composite, scaling term, and Wasserstein alignment.
  • Figure 3: Histogram of preactivation values from a hidden layer in ResNet50, before and after training with VeLU. The post-training distribution is more concentrated and symmetric, reflecting improved activation stability.
  • Figure 4: Output landscapes of a randomly initialized ResNet50 with different activation functions. VeLU yields a smoother, variance-adaptive landscape compared to ReLU, Swish, and GELU.
  • Figure 5: Comparison of VeLU with ReLU, Swish, and GELU in ResNet50. (Left) Validation accuracy: VeLU surpasses all baselines after only a few epochs and keeps a consistent advantage throughout training. (Right) Validation loss: VeLU shows a steep early drop and converges to the lowest loss, whereas the baselines plateau earlier and ReLU exhibits a mild late increase.