How Controlling the Variance can Improve Training Stability of Sparsely Activated DNNs and CNNs
Emily Dent, Jared Tanner
TL;DR
This work extends the Edge-of-Chaos Gaussian-process view of deep networks to sparsity-inducing activations by treating the per-layer variance $q^{(\ell)}$ as a tunable fixed point $q^*$. It demonstrates that increasing $q^*$ improves the symmetry of the variance map, tightens finite-dimensional corrections, and reduces the sensitivity of the backpropagation gain $\chi_1(q)$, thereby enhancing training stability at very high sparsities. The authors derive analytical bounds for finite-width corrections and provide experiments with dense DNNs and CNNs using activations like $\text{CReLU}_{\tau,m}$ and $\text{CST}_{\tau,m}$, achieving up to 90% hidden-layer sparsity while maintaining near-full accuracy and faster convergence. These findings offer a principled parameter to reduce energy consumption in sparsity-driven architectures and point toward extensions to more complex models such as transformers. Overall, the paper supplies both theoretical and empirical support for using $q^*$ as a knob to improve stability and efficiency in sparsely activated networks.
Abstract
The intermediate layers of deep networks can be characterised as a Gaussian process, in particular the Edge-of-Chaos (EoC) initialisation strategy prescribes the limiting covariance matrix of the Gaussian process. Here we show that the under-utilised chosen variance of the Gaussian process is important in the training of deep networks with sparsity inducing activation, such as a shifted and clipped ReLU, $\text{CReLU}_{τ,m}(x)=\min(\max(x-τ,0),m)$. Specifically, initialisations leading to larger fixed Gaussian process variances, allow for improved expressivity with activation sparsity as large as 90% in DNNs and CNNs, and generally improve the stability of the training process. Enabling full, or near full, accuracy at such high levels of sparsity in the hidden layers suggests a promising mechanism to reduce the energy consumption of machine learning models involving fully connected layers.
