Table of Contents
Fetching ...

How Controlling the Variance can Improve Training Stability of Sparsely Activated DNNs and CNNs

Emily Dent, Jared Tanner

TL;DR

This work extends the Edge-of-Chaos Gaussian-process view of deep networks to sparsity-inducing activations by treating the per-layer variance $q^{(\ell)}$ as a tunable fixed point $q^*$. It demonstrates that increasing $q^*$ improves the symmetry of the variance map, tightens finite-dimensional corrections, and reduces the sensitivity of the backpropagation gain $\chi_1(q)$, thereby enhancing training stability at very high sparsities. The authors derive analytical bounds for finite-width corrections and provide experiments with dense DNNs and CNNs using activations like $\text{CReLU}_{\tau,m}$ and $\text{CST}_{\tau,m}$, achieving up to 90% hidden-layer sparsity while maintaining near-full accuracy and faster convergence. These findings offer a principled parameter to reduce energy consumption in sparsity-driven architectures and point toward extensions to more complex models such as transformers. Overall, the paper supplies both theoretical and empirical support for using $q^*$ as a knob to improve stability and efficiency in sparsely activated networks.

Abstract

The intermediate layers of deep networks can be characterised as a Gaussian process, in particular the Edge-of-Chaos (EoC) initialisation strategy prescribes the limiting covariance matrix of the Gaussian process. Here we show that the under-utilised chosen variance of the Gaussian process is important in the training of deep networks with sparsity inducing activation, such as a shifted and clipped ReLU, $\text{CReLU}_{τ,m}(x)=\min(\max(x-τ,0),m)$. Specifically, initialisations leading to larger fixed Gaussian process variances, allow for improved expressivity with activation sparsity as large as 90% in DNNs and CNNs, and generally improve the stability of the training process. Enabling full, or near full, accuracy at such high levels of sparsity in the hidden layers suggests a promising mechanism to reduce the energy consumption of machine learning models involving fully connected layers.

How Controlling the Variance can Improve Training Stability of Sparsely Activated DNNs and CNNs

TL;DR

This work extends the Edge-of-Chaos Gaussian-process view of deep networks to sparsity-inducing activations by treating the per-layer variance as a tunable fixed point . It demonstrates that increasing improves the symmetry of the variance map, tightens finite-dimensional corrections, and reduces the sensitivity of the backpropagation gain , thereby enhancing training stability at very high sparsities. The authors derive analytical bounds for finite-width corrections and provide experiments with dense DNNs and CNNs using activations like and , achieving up to 90% hidden-layer sparsity while maintaining near-full accuracy and faster convergence. These findings offer a principled parameter to reduce energy consumption in sparsity-driven architectures and point toward extensions to more complex models such as transformers. Overall, the paper supplies both theoretical and empirical support for using as a knob to improve stability and efficiency in sparsely activated networks.

Abstract

The intermediate layers of deep networks can be characterised as a Gaussian process, in particular the Edge-of-Chaos (EoC) initialisation strategy prescribes the limiting covariance matrix of the Gaussian process. Here we show that the under-utilised chosen variance of the Gaussian process is important in the training of deep networks with sparsity inducing activation, such as a shifted and clipped ReLU, . Specifically, initialisations leading to larger fixed Gaussian process variances, allow for improved expressivity with activation sparsity as large as 90% in DNNs and CNNs, and generally improve the stability of the training process. Enabling full, or near full, accuracy at such high levels of sparsity in the hidden layers suggests a promising mechanism to reduce the energy consumption of machine learning models involving fully connected layers.
Paper Structure (24 sections, 3 theorems, 44 equations, 27 figures, 3 tables)

This paper contains 24 sections, 3 theorems, 44 equations, 27 figures, 3 tables.

Key Result

Theorem 2.1

Where the recursive relations eq:vmap, eq:fourth_mom_recursion, eq:nlo_recursion hold, assuming $0<V' \left( q^{*}\right) <1$, $q^{\{1\} (1)}=0$ and $r^{(1)}=0$, for $\ell \geq3$.

Figures (27)

  • Figure 1: Non-linear activations $\text{CReLU}_{\tau, m}$, (a), and $\text{CST}_{\tau, m}$, (b), as defined in \ref{['crelu_eq']}, \ref{['cst_eq']} respectively, with $\tau=1$ and $m=1$.
  • Figure 2: $V_{\text{CReLU}_{\tau, m}}(q)$, for $q^*=1$, $s=0.85$ with varying values of $m=1.2, 1.4, 1.6, 1.8, 2.0$, with vertical axis is $V(q)$ and horizontal axis $q$. As $m$ increases $V'_{\text{CReLU}_{\tau, m}}(q) \rightarrow 1$ and $V"_{\text{CReLU}_{\tau, m}}(q)$ exceeds zero, resulting in a second fixed point at $q \approx 3.5$ for $m=2.0$.
  • Figure 3: $V'_{\text{CReLU}_{\tau, m}}(q^*)$ for a range of $s = \{0.6, 0.7, 0.8, 0.85, 0.9, 0.95\}$, the plots of these six sparsities are from left to right then top to bottom, with horizontal axis $q^*$ and vertical axis $m$. For fixed value $m$ by increasing $q^*$, $V'_{\text{CReLU}_{\tau, m}}(q^*)$ reduces, this holds for all six sparsity levels.
  • Figure 4: $V"_{\text{CReLU}_{\tau, m}}(q^*)$, for a range of $s = \{0.6, 0.7, 0.8, 0.85, 0.9, 0.95\}$, the plots of these six sparsities are from left to right then top to bottom, with horizontal axis $q^*$ and vertical axis $m$. For high sparsity levels $s = \{ 0.85, 0.9, 0.95\}$, and fixed value $m$ by increasing $q^*$, $V"_{\text{CReLU}_{\tau, m}}(q^*)$ reduces. For lower sparsity levels $s = \{ 0.6, 0.7, 0.8\}$ with larger values of $m$ that by increasing $q^*$ there is again a reduction in $V"_{\text{CReLU}_{\tau, m}}(q^*)$.
  • Figure 5: Plots of the $\log$ upper bound for the absolute value of the next-leading-order term of the layer variance, $\left|\tilde{q}^{\{1\}(\ell)}\right|$, against $q^*$, as in \ref{['abs_upper_bound_nlo']}, for a range of $s = \{0.85, 0.9, 0.95\}$ from left to right, and fixed $m=2$. By increasing $q^*$ the upper bound for the next-leading-order term $\left|\tilde{q}^{\{1\}(\ell)}\right|$ reduces exponentially.
  • ...and 22 more figures

Theorems & Definitions (6)

  • Theorem 2.1
  • Lemma D.1
  • proof
  • Lemma D.2
  • proof
  • proof : Proof of \ref{['thm:finite_dim_nlo']}