Table of Contents
Fetching ...

Uncovering a Winning Lottery Ticket with Continuously Relaxed Bernoulli Gates

Itamar Tsayag, Ofir Lindenbaum

TL;DR

Experiments across fully connected networks, CNNs, and Vision Transformers demonstrate up to 90% sparsity with minimal accuracy loss - nearly double the sparsity achieved by edge-popup at comparable accuracy - establishing a scalable framework for pre-training network sparsification.

Abstract

Over-parameterized neural networks incur prohibitive memory and computational costs for resource-constrained deployment. The Strong Lottery Ticket (SLT) hypothesis suggests that randomly initialized networks contain sparse subnetworks achieving competitive accuracy without weight training. Existing SLT methods, notably edge-popup, rely on non-differentiable score-based selection, limiting optimization efficiency and scalability. We propose using continuously relaxed Bernoulli gates to discover SLTs through fully differentiable, end-to-end optimization - training only gating parameters while keeping all network weights frozen at their initialized values. Continuous relaxation enables direct gradient-based optimization of an $\ell_0$-regularization objective, eliminating the need for non-differentiable gradient estimators or iterative pruning cycles. To our knowledge, this is the first fully differentiable approach for SLT discovery that avoids straight-through estimator approximations. Experiments across fully connected networks, CNNs (ResNet, Wide-ResNet), and Vision Transformers (ViT, Swin-T) demonstrate up to 90% sparsity with minimal accuracy loss - nearly double the sparsity achieved by edge-popup at comparable accuracy - establishing a scalable framework for pre-training network sparsification.

Uncovering a Winning Lottery Ticket with Continuously Relaxed Bernoulli Gates

TL;DR

Experiments across fully connected networks, CNNs, and Vision Transformers demonstrate up to 90% sparsity with minimal accuracy loss - nearly double the sparsity achieved by edge-popup at comparable accuracy - establishing a scalable framework for pre-training network sparsification.

Abstract

Over-parameterized neural networks incur prohibitive memory and computational costs for resource-constrained deployment. The Strong Lottery Ticket (SLT) hypothesis suggests that randomly initialized networks contain sparse subnetworks achieving competitive accuracy without weight training. Existing SLT methods, notably edge-popup, rely on non-differentiable score-based selection, limiting optimization efficiency and scalability. We propose using continuously relaxed Bernoulli gates to discover SLTs through fully differentiable, end-to-end optimization - training only gating parameters while keeping all network weights frozen at their initialized values. Continuous relaxation enables direct gradient-based optimization of an -regularization objective, eliminating the need for non-differentiable gradient estimators or iterative pruning cycles. To our knowledge, this is the first fully differentiable approach for SLT discovery that avoids straight-through estimator approximations. Experiments across fully connected networks, CNNs (ResNet, Wide-ResNet), and Vision Transformers (ViT, Swin-T) demonstrate up to 90% sparsity with minimal accuracy loss - nearly double the sparsity achieved by edge-popup at comparable accuracy - establishing a scalable framework for pre-training network sparsification.
Paper Structure (24 sections, 5 equations, 3 figures, 3 tables)

This paper contains 24 sections, 5 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Pre-training sparsification on LeNet-300-100. Blue region: percentage of pruned weights. Green region: retained weights. Red line: test accuracy on MNIST as sparsification progresses. The method achieves 96% accuracy at 45% sparsification.
  • Figure 2: Per-layer sparsification of ResNet50 on CIFAR-10. Later layers exhibit higher sparsification rates, consistent with prior findings that early layers require more weights for low-level feature extraction.
  • Figure 3: Robustness to base network size (LeNet on MNIST). Blue: pruned weights. Green: retained weights. Red: test accuracy. SLTs can be discovered even in base networks at 20% of the original size.