On the Selection of Initialization and Activation Function for Deep Neural Networks
Soufiane Hayou, Arnaud Doucet, Judith Rousseau
TL;DR
This work analyzes how initialization and activation choice influence information propagation in deep neural networks by using Gaussian-process approximations in the infinite-width limit. It formalizes the edge of chaos as the critical regime where variance and correlation dynamics permit deeper signal transmission, showing that ReLU-like activations exhibit polynomial, rather than exponential, information retention at the EOC. The authors identify a broader activation class, including Swish, that satisfies sufficient conditions to keep the correlation dynamics close to identity, thereby improving depth scalability. Empirical results on MNIST corroborate the theory, revealing that Swish and EOC initialization yield superior learning speed and accuracy for deep architectures. The findings provide theoretical grounding for activation design and guidance for initialization strategies in practice, with broader relevance to Bayesian networks as well.
Abstract
The weight initialization and the activation function of deep neural networks have a crucial impact on the performance of the training procedure. An inappropriate selection can lead to the loss of information of the input during forward propagation and the exponential vanishing/exploding of gradients during back-propagation. Understanding the theoretical properties of untrained random networks is key to identifying which deep networks may be trained successfully as recently demonstrated by Schoenholz et al. (2017) who showed that for deep feedforward neural networks only a specific choice of hyperparameters known as the `edge of chaos' can lead to good performance. We complete this analysis by providing quantitative results showing that, for a class of ReLU-like activation functions, the information propagates indeed deeper for an initialization at the edge of chaos. By further extending this analysis, we identify a class of activation functions that improve the information propagation over ReLU-like functions. This class includes the Swish activation, $φ_{swish}(x) = x \cdot \text{sigmoid}(x)$, used in Hendrycks & Gimpel (2016), Elfwing et al. (2017) and Ramachandran et al. (2017). This provides a theoretical grounding for the excellent empirical performance of $φ_{swish}$ observed in these contributions. We complement those previous results by illustrating the benefit of using a random initialization on the edge of chaos in this context.
