Table of Contents
Fetching ...

From Dead Neurons to Deep Approximators: Deep Bernstein Networks as a Provable Alternative to Residual Layers

Ibrahim Albool, Malak Gamal El-Din, Salma Elmalaki, Yasser Shoukry

TL;DR

The paper tackles vanishing gradients and dead neurons in deep networks by replacing standard activations with learnable Bernstein polynomials, formulating DeepBern-Nets as residual-free architectures. It provides a theoretical foundation showing a lower bound on Bernstein derivatives, ensuring gradient persistence, and an architectural re-parameterization plus batch normalization to stabilize training. It further proves that approximation error decays exponentially with depth, $\|\mathcal{N}-f\|_\infty \le C_d \cdot \omega_f(1/n^L)$, outperforming ReLU-based polynomial rates, and corroborates these claims with experiments on HIGGS and MNIST demonstrating strong performance without skip connections. Collectively, the work offers a principled path toward deep, residual-free networks with enhanced expressive capacity and training stability, potentially enabling more efficient and scalable models. The results imply that Bernstein activations can deliver both high representational density and trainability in deep architectures. $$

Abstract

Residual connections are the de facto standard for mitigating vanishing gradients, yet they impose structural constraints and fail to address the inherent inefficiencies of piecewise linear activations. We show that Deep Bernstein Networks (which utilizes Bernstein polynomials as activation functions) can act as residual-free architecture while simultaneously optimize trainability and representation power. We provide a two-fold theoretical foundation for our approach. First, we derive a theoretical lower bound on the local derivative, proving it remains strictly bounded away from zero. This directly addresses the root cause of gradient stagnation; empirically, our architecture reduces ``dead'' neurons from 90\% in standard deep networks to less than 5\%, outperforming ReLU, Leaky ReLU, SeLU, and GeLU. Second, we establish that the approximation error for Bernstein-based networks decays exponentially with depth, a significant improvement over the polynomial rates of ReLU-based architectures. By unifying these results, we demonstrate that Bernstein activations provide a superior mechanism for function approximation and signal flow. Our experiments on HIGGS and MNIST confirm that Deep Bernstein Networks achieve high-performance training without skip-connections, offering a principled path toward deep, residual-free architectures with enhanced expressive capacity.

From Dead Neurons to Deep Approximators: Deep Bernstein Networks as a Provable Alternative to Residual Layers

TL;DR

The paper tackles vanishing gradients and dead neurons in deep networks by replacing standard activations with learnable Bernstein polynomials, formulating DeepBern-Nets as residual-free architectures. It provides a theoretical foundation showing a lower bound on Bernstein derivatives, ensuring gradient persistence, and an architectural re-parameterization plus batch normalization to stabilize training. It further proves that approximation error decays exponentially with depth, , outperforming ReLU-based polynomial rates, and corroborates these claims with experiments on HIGGS and MNIST demonstrating strong performance without skip connections. Collectively, the work offers a principled path toward deep, residual-free networks with enhanced expressive capacity and training stability, potentially enabling more efficient and scalable models. The results imply that Bernstein activations can deliver both high representational density and trainability in deep architectures. $$

Abstract

Residual connections are the de facto standard for mitigating vanishing gradients, yet they impose structural constraints and fail to address the inherent inefficiencies of piecewise linear activations. We show that Deep Bernstein Networks (which utilizes Bernstein polynomials as activation functions) can act as residual-free architecture while simultaneously optimize trainability and representation power. We provide a two-fold theoretical foundation for our approach. First, we derive a theoretical lower bound on the local derivative, proving it remains strictly bounded away from zero. This directly addresses the root cause of gradient stagnation; empirically, our architecture reduces ``dead'' neurons from 90\% in standard deep networks to less than 5\%, outperforming ReLU, Leaky ReLU, SeLU, and GeLU. Second, we establish that the approximation error for Bernstein-based networks decays exponentially with depth, a significant improvement over the polynomial rates of ReLU-based architectures. By unifying these results, we demonstrate that Bernstein activations provide a superior mechanism for function approximation and signal flow. Our experiments on HIGGS and MNIST confirm that Deep Bernstein Networks achieve high-performance training without skip-connections, offering a principled path toward deep, residual-free architectures with enhanced expressive capacity.
Paper Structure (36 sections, 9 theorems, 28 equations, 11 figures, 3 tables, 1 algorithm)

This paper contains 36 sections, 9 theorems, 28 equations, 11 figures, 3 tables, 1 algorithm.

Key Result

Proposition 2.1

Consider the Bernstein activation function $\sigma(x;l,u,\boldsymbol{c})$ of arbitrary order $n$. The following holds:

Figures (11)

  • Figure 1: Minimum absolute derivative across training epochs at layers 1 (left), 25 (middle), and 50 (right) using HIGGS dataset (Top) and MNIST dataset (Bottom). The theoretical lower bound $\bar{\delta}$ is shown for Bernstein polynomial activations.
  • Figure 2: Minimum absolute derivative across network depth at the final training epoch. The theoretical lower bound $\bar{\delta}$ is shown for Bernstein polynomial activations. (Left) HIGGS dataset and (Right) MNIST dataset.
  • Figure 3: Dead Neuron Ratio comparison across activation functions on the NN architecture $100 \times 50$ on $\mathbf{HIGGS}$ dataset with the activation functions: $\mathrm{ReLU}$, $\mathrm{ReLU}_{\mathrm{res}}$, $\mathrm{SELU}$, $\mathrm{SELU}_{\mathrm{BN}}$, $\mathrm{GELU}$, $\mathrm{LReLU}_{0.005}$, $\mathrm{LReLU}_{0.01}$, $\mathrm{LReLU}_{0.05}$ against (left)$\mathrm{Bern}_{9,0.01}[-5,5]$, $\mathrm{Bern}_{15,0.01}[-5,5]$, (middle)$\mathrm{Bern}_{9,0.005}$, $\mathrm{Bern}_{9,0.01}$, $\mathrm{Bern}_{9,0.05}$, $\mathrm{Bern}_{9,0.01}[-5,5]$, and (right)$\mathrm{Bern}_{15,0.005}$, $\mathrm{Bern}_{15,0.01}$, $\mathrm{Bern}_{15,0.05}$, $\mathrm{Bern}_{15,0.01}[-5,5]$. The y-axis shows the average dead neuron percentage over the last epoch for a layer in logarithmic scale vs. the layer number.
  • Figure 4: The Absolute Mean Gradient (MAG) of the activations of the first layer over an epoch vs. epoch number over the training. The comparison across activation functions on the NN architecture $100 \times 50$ for the dataset $\mathbf{HIGGS}$ with the activation functions: $\mathrm{ReLU}$, $\mathrm{ReLU}_{\mathrm{res}}$, $\mathrm{SELU}$, $\mathrm{SELU}_{\mathrm{BN}}$, $\mathrm{GELU}$, $\mathrm{LReLU}_{0.005}$, $\mathrm{LReLU}_{0.01}$, $\mathrm{LReLU}_{0.05}$ against (left)$\mathrm{Bern}_{9,0.01}[-5,5]$, $\mathrm{Bern}_{15,0.01}[-5,5]$, (middle)$\mathrm{Bern}_{9,0.005}$, $\mathrm{Bern}_{9,0.01}$, $\mathrm{Bern}_{9,0.05}$, $\mathrm{Bern}_{9,0.01}[-5,5]$, and (right)$\mathrm{Bern}_{15,0.005}$, $\mathrm{Bern}_{15,0.01}$, $\mathrm{Bern}_{15,0.05}$, $\mathrm{Bern}_{15,0.01}[-5,5]$. The y-axis is in logarithmic scale.
  • Figure 5: (Left) The absolute mean gradient of the activations of the first layer over an epoch vs. epoch number over the training. For MNIST Dead ratio comparison for non-Bernstein activations (Middle) and Bernstein activations (Right).
  • ...and 6 more figures

Theorems & Definitions (16)

  • Proposition 2.1: khedr2024deepbern
  • Theorem 3.1: Bernstein Gradient Lower Bound
  • Definition 4.1: Modulus of Continuity
  • Lemma 4.2: Effective Degree of DeepBernNets
  • proof
  • Theorem 4.5: Exponential Approximation Rate
  • proof
  • Lemma 1.1: Bernstein Derivative Formula
  • proof
  • Lemma 1.2: Bounds on Local Derivative
  • ...and 6 more