Table of Contents
Fetching ...

Find A Winning Sign: Sign Is All We Need to Win the Lottery

Junghun Oh, Sungyong Baik, Kyoung Mu Lee

TL;DR

The paper addresses the challenge of finding winning tickets that generalize from random initializations by highlighting the critical role of parameter signs and normalization layers. It introduces AWS, a sign-based variant of learning rate rewinding that interpolates normalization parameters during training to keep the sparse subnet in the basin of attraction, yielding a transferable signed mask $s_T^{AWS}$. Empirical results across CIFAR-100, Tiny-ImageNet, and ImageNet with various architectures show that random networks masked with $s_T^{AWS}$ can reach performance close to dense networks after training, and exhibit strong SGD stability and linear mode connectivity to the AWS solution. This work advances toward true LTH by enabling generalization transfer from a signed mask to arbitrary random initializations, with implications for robust, initialization-agnostic pruning and transfer learning.

Abstract

The Lottery Ticket Hypothesis (LTH) posits the existence of a sparse subnetwork (a.k.a. winning ticket) that can generalize comparably to its over-parameterized counterpart when trained from scratch. The common approach to finding a winning ticket is to preserve the original strong generalization through Iterative Pruning (IP) and transfer information useful for achieving the learned generalization by applying the resulting sparse mask to an untrained network. However, existing IP methods still struggle to generalize their observations beyond ad-hoc initialization and small-scale architectures or datasets, or they bypass these challenges by applying their mask to trained weights instead of initialized ones. In this paper, we demonstrate that the parameter sign configuration plays a crucial role in conveying useful information for generalization to any randomly initialized network. Through linear mode connectivity analysis, we observe that a sparse network trained by an existing IP method can retain its basin of attraction if its parameter signs and normalization layer parameters are preserved. To take a step closer to finding a winning ticket, we alleviate the reliance on normalization layer parameters by preventing high error barriers along the linear path between the sparse network trained by our method and its counterpart with initialized normalization layer parameters. Interestingly, across various architectures and datasets, we observe that any randomly initialized network can be optimized to exhibit low error barriers along the linear path to the sparse network trained by our method by inheriting its sparsity and parameter sign information, potentially achieving performance comparable to the original. The code is available at https://github.com/JungHunOh/AWS\_ICLR2025.git

Find A Winning Sign: Sign Is All We Need to Win the Lottery

TL;DR

The paper addresses the challenge of finding winning tickets that generalize from random initializations by highlighting the critical role of parameter signs and normalization layers. It introduces AWS, a sign-based variant of learning rate rewinding that interpolates normalization parameters during training to keep the sparse subnet in the basin of attraction, yielding a transferable signed mask . Empirical results across CIFAR-100, Tiny-ImageNet, and ImageNet with various architectures show that random networks masked with can reach performance close to dense networks after training, and exhibit strong SGD stability and linear mode connectivity to the AWS solution. This work advances toward true LTH by enabling generalization transfer from a signed mask to arbitrary random initializations, with implications for robust, initialization-agnostic pruning and transfer learning.

Abstract

The Lottery Ticket Hypothesis (LTH) posits the existence of a sparse subnetwork (a.k.a. winning ticket) that can generalize comparably to its over-parameterized counterpart when trained from scratch. The common approach to finding a winning ticket is to preserve the original strong generalization through Iterative Pruning (IP) and transfer information useful for achieving the learned generalization by applying the resulting sparse mask to an untrained network. However, existing IP methods still struggle to generalize their observations beyond ad-hoc initialization and small-scale architectures or datasets, or they bypass these challenges by applying their mask to trained weights instead of initialized ones. In this paper, we demonstrate that the parameter sign configuration plays a crucial role in conveying useful information for generalization to any randomly initialized network. Through linear mode connectivity analysis, we observe that a sparse network trained by an existing IP method can retain its basin of attraction if its parameter signs and normalization layer parameters are preserved. To take a step closer to finding a winning ticket, we alleviate the reliance on normalization layer parameters by preventing high error barriers along the linear path between the sparse network trained by our method and its counterpart with initialized normalization layer parameters. Interestingly, across various architectures and datasets, we observe that any randomly initialized network can be optimized to exhibit low error barriers along the linear path to the sparse network trained by our method by inheriting its sparsity and parameter sign information, potentially achieving performance comparable to the original. The code is available at https://github.com/JungHunOh/AWS\_ICLR2025.git

Paper Structure

This paper contains 14 sections, 2 equations, 6 figures, 2 tables, 1 algorithm.

Figures (6)

  • Figure 1: Illustration of our motivation and method.${\boldsymbol{\psi}}$ and ${\boldsymbol{\phi}}$ denote network parameters of normalization layers and parameters excluding those of normalization layers, respectively. The 'LMC region' refers to a region of solutions that are linearly mode-connected to the LRR or AWS solution.
  • Figure 2: Motivational experiments on CIFAR-100. We investigate the effect of parameter initialization in the LRR subnetwork while preserving their signs with respect to (a) test accuracy, (b) SGD-noise stability, and (c) linear mode connectivity with the LRR subnetwork. In (b) and (c), we use a pruned network with a remaining parameter ratio of approximately 0.09. We show the mean (each point) and standard deviation (shaded area) across 3 trials.
  • Figure 3: Main results on CIFAR-100 and Tiny-ImageNet.(a): Test accuracy of the LRR solution (blue), the AWS solution (green), a randomly initialized network trained with the LRR-driven signed mask (orange), and a randomly initialized network trained with the AWS-driven signed mask (red). (b) and (c): Analysis of SGD noise stability and linear mode connectivity, respectively. A randomly initialized network trained with the AWS-driven signed mask exhibits high SGD noise stability and low error barriers along the linear path to the AWS solution (green), contrasting to the case of LRR (orange). In (b) and (c), we use a pruned network with a remaining parameter ratio of approximately 0.09. We report the mean (each point) and standard deviation (shaded area) across 3 trials.
  • Figure 4: Comparison to GM on CIFAR-100 with ResNet-32. The results of GM are approximated from sreenivasan2022rare.
  • Figure 5: Effect of transferring the sign of normalization layer parameters.${\boldsymbol{\psi}}_\text{init}^*$ denotes the initialized normalization layer parameters whose scaling and bias factors are set to 1 and 0.1. We conduct the experiments on CIFAR-100.
  • ...and 1 more figures