Table of Contents
Fetching ...

On the Selection of Initialization and Activation Function for Deep Neural Networks

Soufiane Hayou, Arnaud Doucet, Judith Rousseau

TL;DR

This work analyzes how initialization and activation choice influence information propagation in deep neural networks by using Gaussian-process approximations in the infinite-width limit. It formalizes the edge of chaos as the critical regime where variance and correlation dynamics permit deeper signal transmission, showing that ReLU-like activations exhibit polynomial, rather than exponential, information retention at the EOC. The authors identify a broader activation class, including Swish, that satisfies sufficient conditions to keep the correlation dynamics close to identity, thereby improving depth scalability. Empirical results on MNIST corroborate the theory, revealing that Swish and EOC initialization yield superior learning speed and accuracy for deep architectures. The findings provide theoretical grounding for activation design and guidance for initialization strategies in practice, with broader relevance to Bayesian networks as well.

Abstract

The weight initialization and the activation function of deep neural networks have a crucial impact on the performance of the training procedure. An inappropriate selection can lead to the loss of information of the input during forward propagation and the exponential vanishing/exploding of gradients during back-propagation. Understanding the theoretical properties of untrained random networks is key to identifying which deep networks may be trained successfully as recently demonstrated by Schoenholz et al. (2017) who showed that for deep feedforward neural networks only a specific choice of hyperparameters known as the `edge of chaos' can lead to good performance. We complete this analysis by providing quantitative results showing that, for a class of ReLU-like activation functions, the information propagates indeed deeper for an initialization at the edge of chaos. By further extending this analysis, we identify a class of activation functions that improve the information propagation over ReLU-like functions. This class includes the Swish activation, $φ_{swish}(x) = x \cdot \text{sigmoid}(x)$, used in Hendrycks & Gimpel (2016), Elfwing et al. (2017) and Ramachandran et al. (2017). This provides a theoretical grounding for the excellent empirical performance of $φ_{swish}$ observed in these contributions. We complement those previous results by illustrating the benefit of using a random initialization on the edge of chaos in this context.

On the Selection of Initialization and Activation Function for Deep Neural Networks

TL;DR

This work analyzes how initialization and activation choice influence information propagation in deep neural networks by using Gaussian-process approximations in the infinite-width limit. It formalizes the edge of chaos as the critical regime where variance and correlation dynamics permit deeper signal transmission, showing that ReLU-like activations exhibit polynomial, rather than exponential, information retention at the EOC. The authors identify a broader activation class, including Swish, that satisfies sufficient conditions to keep the correlation dynamics close to identity, thereby improving depth scalability. Empirical results on MNIST corroborate the theory, revealing that Swish and EOC initialization yield superior learning speed and accuracy for deep architectures. The findings provide theoretical grounding for activation design and guidance for initialization strategies in practice, with broader relevance to Bayesian networks as well.

Abstract

The weight initialization and the activation function of deep neural networks have a crucial impact on the performance of the training procedure. An inappropriate selection can lead to the loss of information of the input during forward propagation and the exponential vanishing/exploding of gradients during back-propagation. Understanding the theoretical properties of untrained random networks is key to identifying which deep networks may be trained successfully as recently demonstrated by Schoenholz et al. (2017) who showed that for deep feedforward neural networks only a specific choice of hyperparameters known as the `edge of chaos' can lead to good performance. We complete this analysis by providing quantitative results showing that, for a class of ReLU-like activation functions, the information propagates indeed deeper for an initialization at the edge of chaos. By further extending this analysis, we identify a class of activation functions that improve the information propagation over ReLU-like functions. This class includes the Swish activation, , used in Hendrycks & Gimpel (2016), Elfwing et al. (2017) and Ramachandran et al. (2017). This provides a theoretical grounding for the excellent empirical performance of observed in these contributions. We complement those previous results by illustrating the benefit of using a random initialization on the edge of chaos in this context.

Paper Structure

This paper contains 19 sections, 18 theorems, 33 equations, 11 figures, 3 tables.

Key Result

Proposition 1

Let $M_{\phi} := \mathrm{sup}_{x\geq 0} \mathbb{E}[|\phi'^2(x Z) + \phi"(x Z) \phi(x Z)|]$. Assume $M_{\phi} < \infty$, then for $\sigma_w^2 < \frac{1}{M_{\phi}}$ and any $\sigma_b$, we have $(\sigma_b, \sigma_w) \in D_{\phi, var}$ and $K_{\phi, var}(\sigma_b, \sigma_w) = \infty$ Let $C_{\phi, \delt

Figures (11)

  • Figure 1: Two draws of outputs for ReLU and Tanh networks with $(\sigma_b, \sigma_w)=(1, 1) \in D_{\phi, var} \cap D_{\phi, corr}$. The output functions are almost constant.
  • Figure 2: A draw from the output function of a ReLu network with 20 layers, 100 neurons per layer, $(\sigma_b^2, \sigma_w^2) = (0, 2)$ (edge of chaos)
  • Figure 3: Impact of the initialization on the EOC for a ReLU network
  • Figure 4: Correlation function and a draw of the output for a Swish network
  • Figure 5: Impact of the initialization on the edge of chaos for Swish network
  • ...and 6 more figures

Theorems & Definitions (30)

  • Definition 1
  • Proposition 1
  • Definition 2
  • Lemma 1
  • Proposition 2
  • Proposition 3: ReLU kernel
  • Proposition 4: Main Result
  • Lemma 2
  • Proposition 5
  • Proposition 6
  • ...and 20 more