Table of Contents
Fetching ...

Sign-In to the Lottery: Reparameterizing Sparse Training From Scratch

Advait Gadhikar, Tom Jacobs, Chao Zhou, Rebekka Burkholz

TL;DR

The paper addresses the challenge of training sparse neural networks from scratch (PaI) and identifies parameter signs as a key missing piece compared to dense-to-sparse methods. It introduces Sign-In, a reparameterization $\theta \mapsto m \odot w$ with an inner scaling that provably induces sign flips and promotes sign alignment, yielding improved PaI performance across masks and architectures. The authors provide both theoretical (Riemannian gradient flow) and empirical support, including a no-replacement impossibility result for replacing overparameterization and strong results on sign recovery in simple settings. While Sign-In enhances PaI and is orthogonal to dense pretraining, it does not fully bridge the gap to dense-to-sparse training, underscoring the ongoing challenge of training sparse networks from scratch. Overall, sign alignment emerges as a sufficient condition for sparse trainability, and Sign-In offers a practical mechanism to realize it with modest overhead and broad applicability.

Abstract

The performance gap between training sparse neural networks from scratch (PaI) and dense-to-sparse training presents a major roadblock for efficient deep learning. According to the Lottery Ticket Hypothesis, PaI hinges on finding a problem specific parameter initialization. As we show, to this end, determining correct parameter signs is sufficient. Yet, they remain elusive to PaI. To address this issue, we propose Sign-In, which employs a dynamic reparameterization that provably induces sign flips. Such sign flips are complementary to the ones that dense-to-sparse training can accomplish, rendering Sign-In as an orthogonal method. While our experiments and theory suggest performance improvements of PaI, they also carve out the main open challenge to close the gap between PaI and dense-to-sparse training.

Sign-In to the Lottery: Reparameterizing Sparse Training From Scratch

TL;DR

The paper addresses the challenge of training sparse neural networks from scratch (PaI) and identifies parameter signs as a key missing piece compared to dense-to-sparse methods. It introduces Sign-In, a reparameterization with an inner scaling that provably induces sign flips and promotes sign alignment, yielding improved PaI performance across masks and architectures. The authors provide both theoretical (Riemannian gradient flow) and empirical support, including a no-replacement impossibility result for replacing overparameterization and strong results on sign recovery in simple settings. While Sign-In enhances PaI and is orthogonal to dense pretraining, it does not fully bridge the gap to dense-to-sparse training, underscoring the ongoing challenge of training sparse networks from scratch. Overall, sign alignment emerges as a sufficient condition for sparse trainability, and Sign-In offers a practical mechanism to realize it with modest overhead and broad applicability.

Abstract

The performance gap between training sparse neural networks from scratch (PaI) and dense-to-sparse training presents a major roadblock for efficient deep learning. According to the Lottery Ticket Hypothesis, PaI hinges on finding a problem specific parameter initialization. As we show, to this end, determining correct parameter signs is sufficient. Yet, they remain elusive to PaI. To address this issue, we propose Sign-In, which employs a dynamic reparameterization that provably induces sign flips. Such sign flips are complementary to the ones that dense-to-sparse training can accomplish, rendering Sign-In as an orthogonal method. While our experiments and theory suggest performance improvements of PaI, they also carve out the main open challenge to close the gap between PaI and dense-to-sparse training.

Paper Structure

This paper contains 26 sections, 4 theorems, 14 equations, 8 figures, 16 tables, 1 algorithm.

Key Result

Theorem 5.1

(Theorem 2.1 gadhikar2024masks) In the one-dimensional single neuron setting, Eq. (theory : gf) with $d=1$, the student can recover the ground truth given sufficiently many samples, if $a_{in} > 0$ and $w_{1,in} > 0$. In all other cases ($a_{in} > 0$, $w_{1,in} < 0$), ($a_{in} < 0$, $w_{in} > 0$) an

Figures (8)

  • Figure 1: Sign learning a $90\%$ sparse ResNet50 on ImageNet. The majority of signs are flipped early for dense-to-sparse methods, upto warmup. Moreover, signs stabilize after warmp-up. In contrast, signs do not stabilize after warmup for sparse training from scratch for different masks.
  • Figure 2: Sign-In recovers the solution in a different case than overpameterization for a two-layer network by sign flipping. Combining the two solves all cases (empirically).
  • Figure 3: Student-teacher setup (a) Training one neuron for a single input $d=1$, when $w_1 \leq 0$ both methods fail to reach the ground truth. If $w_1 > 0$, the Sign-In gradient flow succeeds, whereas gradient flow fails in this case when additionally $a < 0$. (b) Representation of a two layer student-teacher neural network with multiple neurons inspired by chizat2020lazy for sparse training from scratch fails with bad signs. (c) Sign-In enables learning the representation with bad signs.
  • Figure 4: Sign flips with Sign-In Training a random, $90\%$ sparse ResNet50 on ImageNet with Sign-In induces (a) more sign flips during training and finds (b) a flatter minimum.
  • Figure 5: Representation of a two layer student-teacher neural network inspired by chizat2020lazy. Signs are crucial in learning sparse representations. A dense neural network can easily learn the representation in Fig. \ref{['fig:dense']}. Sparse networks can learn when signs are correct Fig. \ref{['fig:good sparse']} and fail with bad signs Fig. \ref{['fig:bad sparse']}. The reparameterization enables learning the representation with bad signs in Fig. \ref{['fig:mw']}.
  • ...and 3 more figures

Theorems & Definitions (5)

  • Theorem 5.1
  • Theorem 5.2: Sign-In improves PaI
  • Remark 5.3
  • Theorem 5.4: No reparameterization replaces overparameterization
  • Theorem A.1