Table of Contents
Fetching ...

Sparse maximal update parameterization: A holistic approach to sparse training dynamics

Nolan Dey, Shane Bergsma, Joel Hestness

TL;DR

<3-5 sentence high-level summary>

Abstract

Several challenges make it difficult for sparse neural networks to compete with dense models. First, setting a large fraction of weights to zero impairs forward and gradient signal propagation. Second, sparse studies often need to test multiple sparsity levels, while also introducing new hyperparameters (HPs), leading to prohibitive tuning costs. Indeed, the standard practice is to re-use the learning HPs originally crafted for dense models. Unfortunately, we show sparse and dense networks do not share the same optimal HPs. Without stable dynamics and effective training recipes, it is costly to test sparsity at scale, which is key to surpassing dense networks and making the business case for sparsity acceleration in hardware. A holistic approach is needed to tackle these challenges and we propose S$μ$Par as one such approach. For random unstructured static sparsity, S$μ$Par ensures activations, gradients, and weight updates all scale independently of sparsity level. Further, by reparameterizing the HPs, S$μ$Par enables the same HP values to be optimal as we vary both sparsity level and model width. HPs can be tuned on small dense networks and transferred to large sparse models, greatly reducing tuning costs. On large-scale language modeling, S$μ$Par shows increasing improvements over standard parameterization as sparsity increases, leading up to 11.9% relative loss improvement at 99.2% sparsity. A minimal implementation of S$μ$Par is available at https://github.com/EleutherAI/nanoGPT-mup/tree/supar.

Sparse maximal update parameterization: A holistic approach to sparse training dynamics

TL;DR

<3-5 sentence high-level summary>

Abstract

Several challenges make it difficult for sparse neural networks to compete with dense models. First, setting a large fraction of weights to zero impairs forward and gradient signal propagation. Second, sparse studies often need to test multiple sparsity levels, while also introducing new hyperparameters (HPs), leading to prohibitive tuning costs. Indeed, the standard practice is to re-use the learning HPs originally crafted for dense models. Unfortunately, we show sparse and dense networks do not share the same optimal HPs. Without stable dynamics and effective training recipes, it is costly to test sparsity at scale, which is key to surpassing dense networks and making the business case for sparsity acceleration in hardware. A holistic approach is needed to tackle these challenges and we propose SPar as one such approach. For random unstructured static sparsity, SPar ensures activations, gradients, and weight updates all scale independently of sparsity level. Further, by reparameterizing the HPs, SPar enables the same HP values to be optimal as we vary both sparsity level and model width. HPs can be tuned on small dense networks and transferred to large sparse models, greatly reducing tuning costs. On large-scale language modeling, SPar shows increasing improvements over standard parameterization as sparsity increases, leading up to 11.9% relative loss improvement at 99.2% sparsity. A minimal implementation of SPar is available at https://github.com/EleutherAI/nanoGPT-mup/tree/supar.
Paper Structure (38 sections, 21 equations, 15 figures, 3 tables)

This paper contains 38 sections, 21 equations, 15 figures, 3 tables.

Figures (15)

  • Figure 1: smup (Our work) allows stable optimum HPs for any sparsity level, unlike standard practice.
  • Figure 2: smup enables sparse training at scale, helping to surpass dense and motivate sparsity in hardware.
  • Figure 3: For LLMs, smup forms the Pareto frontier loss across sparsity levels, with no HP tuning required.
  • Figure 4: The three operations associated with training a layer with weights that perform the function $\mathcal{F}$: Forward activation calculation, backward gradient propagation, and the weight update.
  • Figure 5: Mean absolute value activations for attention and feed forward blocks after training step $t$ (10 seeds). In SP and mup models, decreasing density causes activations to vanish (note axes on log-scale). In smup models, density has little effect on activation scales and there is no vanishing.
  • ...and 10 more figures