How Controlling the Variance can Improve Training Stability of Sparsely Activated DNNs and CNNs

Emily Dent; Jared Tanner

How Controlling the Variance can Improve Training Stability of Sparsely Activated DNNs and CNNs

Emily Dent, Jared Tanner

TL;DR

This work extends the Edge-of-Chaos Gaussian-process view of deep networks to sparsity-inducing activations by treating the per-layer variance $q^{(\ell)}$ as a tunable fixed point $q^*$. It demonstrates that increasing $q^*$ improves the symmetry of the variance map, tightens finite-dimensional corrections, and reduces the sensitivity of the backpropagation gain $\chi_1(q)$, thereby enhancing training stability at very high sparsities. The authors derive analytical bounds for finite-width corrections and provide experiments with dense DNNs and CNNs using activations like $\text{CReLU}_{\tau,m}$ and $\text{CST}_{\tau,m}$, achieving up to 90% hidden-layer sparsity while maintaining near-full accuracy and faster convergence. These findings offer a principled parameter to reduce energy consumption in sparsity-driven architectures and point toward extensions to more complex models such as transformers. Overall, the paper supplies both theoretical and empirical support for using $q^*$ as a knob to improve stability and efficiency in sparsely activated networks.

Abstract

The intermediate layers of deep networks can be characterised as a Gaussian process, in particular the Edge-of-Chaos (EoC) initialisation strategy prescribes the limiting covariance matrix of the Gaussian process. Here we show that the under-utilised chosen variance of the Gaussian process is important in the training of deep networks with sparsity inducing activation, such as a shifted and clipped ReLU, $\text{CReLU}_{τ,m}(x)=\min(\max(x-τ,0),m)$. Specifically, initialisations leading to larger fixed Gaussian process variances, allow for improved expressivity with activation sparsity as large as 90% in DNNs and CNNs, and generally improve the stability of the training process. Enabling full, or near full, accuracy at such high levels of sparsity in the hidden layers suggests a promising mechanism to reduce the energy consumption of machine learning models involving fully connected layers.

How Controlling the Variance can Improve Training Stability of Sparsely Activated DNNs and CNNs

TL;DR

This work extends the Edge-of-Chaos Gaussian-process view of deep networks to sparsity-inducing activations by treating the per-layer variance

as a tunable fixed point

. It demonstrates that increasing

improves the symmetry of the variance map, tightens finite-dimensional corrections, and reduces the sensitivity of the backpropagation gain

, thereby enhancing training stability at very high sparsities. The authors derive analytical bounds for finite-width corrections and provide experiments with dense DNNs and CNNs using activations like

and

, achieving up to 90% hidden-layer sparsity while maintaining near-full accuracy and faster convergence. These findings offer a principled parameter to reduce energy consumption in sparsity-driven architectures and point toward extensions to more complex models such as transformers. Overall, the paper supplies both theoretical and empirical support for using

as a knob to improve stability and efficiency in sparsely activated networks.

Abstract

. Specifically, initialisations leading to larger fixed Gaussian process variances, allow for improved expressivity with activation sparsity as large as 90% in DNNs and CNNs, and generally improve the stability of the training process. Enabling full, or near full, accuracy at such high levels of sparsity in the hidden layers suggests a promising mechanism to reduce the energy consumption of machine learning models involving fully connected layers.

Paper Structure (24 sections, 3 theorems, 44 equations, 27 figures, 3 tables)

This paper contains 24 sections, 3 theorems, 44 equations, 27 figures, 3 tables.

Introduction
Related Work
Edge-of-Chaos
Main Contributions
Concentrating q
Finite Dimensional Correction
Sensitivity of chi1(q)
Summary of Findings
Experiments
Conclusion and Further Extensions
Further on chi1
Derivation of Further Derivatives for CReLU and CST
Variance Map Analysis for CST
Finite Dimensional Correction
Further DNN Experiment Results
...and 9 more sections

Key Result

Theorem 2.1

Where the recursive relations eq:vmap, eq:fourth_mom_recursion, eq:nlo_recursion hold, assuming $0<V' \left( q^{*}\right) <1$, $q^{\{1\} (1)}=0$ and $r^{(1)}=0$, for $\ell \geq3$.

Figures (27)

Figure 1: Non-linear activations $\text{CReLU}_{\tau, m}$, (a), and $\text{CST}_{\tau, m}$, (b), as defined in \ref{['crelu_eq']}, \ref{['cst_eq']} respectively, with $\tau=1$ and $m=1$.
Figure 2: $V_{\text{CReLU}_{\tau, m}}(q)$, for $q^*=1$, $s=0.85$ with varying values of $m=1.2, 1.4, 1.6, 1.8, 2.0$, with vertical axis is $V(q)$ and horizontal axis $q$. As $m$ increases $V'_{\text{CReLU}_{\tau, m}}(q) \rightarrow 1$ and $V"_{\text{CReLU}_{\tau, m}}(q)$ exceeds zero, resulting in a second fixed point at $q \approx 3.5$ for $m=2.0$.
Figure 3: $V'_{\text{CReLU}_{\tau, m}}(q^*)$ for a range of $s = \{0.6, 0.7, 0.8, 0.85, 0.9, 0.95\}$, the plots of these six sparsities are from left to right then top to bottom, with horizontal axis $q^*$ and vertical axis $m$. For fixed value $m$ by increasing $q^*$, $V'_{\text{CReLU}_{\tau, m}}(q^*)$ reduces, this holds for all six sparsity levels.
Figure 4: $V"_{\text{CReLU}_{\tau, m}}(q^*)$, for a range of $s = \{0.6, 0.7, 0.8, 0.85, 0.9, 0.95\}$, the plots of these six sparsities are from left to right then top to bottom, with horizontal axis $q^*$ and vertical axis $m$. For high sparsity levels $s = \{ 0.85, 0.9, 0.95\}$, and fixed value $m$ by increasing $q^*$, $V"_{\text{CReLU}_{\tau, m}}(q^*)$ reduces. For lower sparsity levels $s = \{ 0.6, 0.7, 0.8\}$ with larger values of $m$ that by increasing $q^*$ there is again a reduction in $V"_{\text{CReLU}_{\tau, m}}(q^*)$.
Figure 5: Plots of the $\log$ upper bound for the absolute value of the next-leading-order term of the layer variance, $\left|\tilde{q}^{\{1\}(\ell)}\right|$, against $q^*$, as in \ref{['abs_upper_bound_nlo']}, for a range of $s = \{0.85, 0.9, 0.95\}$ from left to right, and fixed $m=2$. By increasing $q^*$ the upper bound for the next-leading-order term $\left|\tilde{q}^{\{1\}(\ell)}\right|$ reduces exponentially.
...and 22 more figures

Theorems & Definitions (6)

Theorem 2.1
Lemma D.1
proof
Lemma D.2
proof
proof : Proof of \ref{['thm:finite_dim_nlo']}

How Controlling the Variance can Improve Training Stability of Sparsely Activated DNNs and CNNs

TL;DR

Abstract

How Controlling the Variance can Improve Training Stability of Sparsely Activated DNNs and CNNs

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (27)

Theorems & Definitions (6)