Table of Contents
Fetching ...

Sparser, Better, Deeper, Stronger: Improving Sparse Training with Exact Orthogonal Initialization

Aleksandra Irena Nowak, Łukasz Gniecki, Filip Szatkowski, Jacek Tabor

TL;DR

The paper tackles static sparse training by addressing the joint design of sparse masks and weight initialization, highlighting that exact spectral properties of the Jacobian critically influence trainability. It introduces Exact Orthogonal Initialization (EOI), which builds sparse, exactly orthogonal weight matrices via composing random Givens rotations to achieve target per-layer densities $d^l$, enabling dynamical isometry without residual or normalization tricks. Empirical results show EOI maintains favorable singular-value spectra, enables training of extremely deep vanilla networks, and delivers consistent performance gains over approximated isometry and other SST baselines across MNIST, CIFAR, and ImageNet-scale tasks; EOI also offers substantial computational efficiency due to the $O(n)$ cost of Givens rotations. The work demonstrates practical impact by improving training stability and accuracy in a variety of architectures (ResNets, VGG, EfficientNet, DeiT) and highlights the need to consider both mask and weight initialization in sparse training, with code available at the project repository.

Abstract

Static sparse training aims to train sparse models from scratch, achieving remarkable results in recent years. A key design choice is given by the sparse initialization, which determines the trainable sub-network through a binary mask. Existing methods mainly select such mask based on a predefined dense initialization. Such an approach may not efficiently leverage the mask's potential impact on the optimization. An alternative direction, inspired by research into dynamical isometry, is to introduce orthogonality in the sparse subnetwork, which helps in stabilizing the gradient signal. In this work, we propose Exact Orthogonal Initialization (EOI), a novel sparse orthogonal initialization scheme based on composing random Givens rotations. Contrary to other existing approaches, our method provides exact (not approximated) orthogonality and enables the creation of layers with arbitrary densities. We demonstrate the superior effectiveness and efficiency of EOI through experiments, consistently outperforming common sparse initialization techniques. Our method enables training highly sparse 1000-layer MLP and CNN networks without residual connections or normalization techniques, emphasizing the crucial role of weight initialization in static sparse training alongside sparse mask selection. The code is available at https://github.com/woocash2/sparser-better-deeper-stronger

Sparser, Better, Deeper, Stronger: Improving Sparse Training with Exact Orthogonal Initialization

TL;DR

The paper tackles static sparse training by addressing the joint design of sparse masks and weight initialization, highlighting that exact spectral properties of the Jacobian critically influence trainability. It introduces Exact Orthogonal Initialization (EOI), which builds sparse, exactly orthogonal weight matrices via composing random Givens rotations to achieve target per-layer densities , enabling dynamical isometry without residual or normalization tricks. Empirical results show EOI maintains favorable singular-value spectra, enables training of extremely deep vanilla networks, and delivers consistent performance gains over approximated isometry and other SST baselines across MNIST, CIFAR, and ImageNet-scale tasks; EOI also offers substantial computational efficiency due to the cost of Givens rotations. The work demonstrates practical impact by improving training stability and accuracy in a variety of architectures (ResNets, VGG, EfficientNet, DeiT) and highlights the need to consider both mask and weight initialization in sparse training, with code available at the project repository.

Abstract

Static sparse training aims to train sparse models from scratch, achieving remarkable results in recent years. A key design choice is given by the sparse initialization, which determines the trainable sub-network through a binary mask. Existing methods mainly select such mask based on a predefined dense initialization. Such an approach may not efficiently leverage the mask's potential impact on the optimization. An alternative direction, inspired by research into dynamical isometry, is to introduce orthogonality in the sparse subnetwork, which helps in stabilizing the gradient signal. In this work, we propose Exact Orthogonal Initialization (EOI), a novel sparse orthogonal initialization scheme based on composing random Givens rotations. Contrary to other existing approaches, our method provides exact (not approximated) orthogonality and enables the creation of layers with arbitrary densities. We demonstrate the superior effectiveness and efficiency of EOI through experiments, consistently outperforming common sparse initialization techniques. Our method enables training highly sparse 1000-layer MLP and CNN networks without residual connections or normalization techniques, emphasizing the crucial role of weight initialization in static sparse training alongside sparse mask selection. The code is available at https://github.com/woocash2/sparser-better-deeper-stronger
Paper Structure (33 sections, 2 theorems, 12 equations, 12 figures, 6 tables, 1 algorithm)

This paper contains 33 sections, 2 theorems, 12 equations, 12 figures, 6 tables, 1 algorithm.

Key Result

Theorem 1

Define $p(t, k)$ as the probability that the top row of $A^{(t)}$ will have exactly $k$ non-zero elements. The following recurrence relation holds: with the base condition $p(0, 1) = 1$ and $p(0, k) = 0$ for $k \neq 1$.

Figures (12)

  • Figure 1: Sparse orthogonal matrix generation via composition of Givens rotations.
  • Figure 2: The mean (top row) and maximum (middle row) singular values of the input-output Jacobian of an MLP network computed for varying sparsity. In addition, we also present the training loss curve (bottom row) for sparsity 0.95. The colors indicate the used activation function, while the line- and marker-styles represent the initialization schemes. In the loss curve plots, for clarity of the presentation, we show only the ReLU and linear activation. See \ref{['app:losses']} for other activations.
  • Figure 3: Top-1 and Top-5 validation accuracy on ImageNet obtained for the ERK and ERK-EOI initializations with density $0.1$ on the ResNet50 (Left) and DeiT III (Right) models.
  • Figure 4: Expected density (red) of a sparse matrix produced by Algorithm \ref{['alg:EOI']} as as a function of the number of applied Givens rotations. Blue curve represents the empirical evaluation.
  • Figure 5: Training loss curve for sparsity 0.95 for Tanh and Hard Tanh activation functions.
  • ...and 7 more figures

Theorems & Definitions (4)

  • Theorem 1
  • proof
  • Theorem 2
  • proof