Table of Contents
Fetching ...

When Layers Play the Lottery, all Tickets Win at Initialization

Artur Jordao, George Correa de Araujo, Helena de Almeida Maia, Helio Pedrini

TL;DR

The paper investigates pruning at initialization through layer removal within the Lottery Ticket Hypothesis framework. It shows that winning tickets can arise when entire layers are pruned and, crucially, can be discovered before training using data-driven criteria like GraSP, circumventing the need to train a dense network first. Empirically, layer-based winning tickets match or exceed dense-network accuracy across several settings, yield up to 2× training speedups and substantial CO2-emission reductions (up to 51%), and enhance robustness to adversarial and out-of-distribution inputs, outperforming standard filter-pruning tickets in initialization scenarios. This work introduces a new, architecture-agnostic direction for LTH that emphasizes layer-level pruning to achieve greener, faster, and more robust deep learning models in residual architectures.

Abstract

Pruning is a standard technique for reducing the computational cost of deep networks. Many advances in pruning leverage concepts from the Lottery Ticket Hypothesis (LTH). LTH reveals that inside a trained dense network exists sparse subnetworks (tickets) able to achieve similar accuracy (i.e., win the lottery - winning tickets). Pruning at initialization focuses on finding winning tickets without training a dense network. Studies on these concepts share the trend that subnetworks come from weight or filter pruning. In this work, we investigate LTH and pruning at initialization from the lens of layer pruning. First, we confirm the existence of winning tickets when the pruning process removes layers. Leveraged by this observation, we propose to discover these winning tickets at initialization, eliminating the requirement of heavy computational resources for training the initial (over-parameterized) dense network. Extensive experiments show that our winning tickets notably speed up the training phase and reduce up to 51% of carbon emission, an important step towards democratization and green Artificial Intelligence. Beyond computational benefits, our winning tickets exhibit robustness against adversarial and out-of-distribution examples. Finally, we show that our subnetworks easily win the lottery at initialization while tickets from filter removal (the standard structured LTH) hardly become winning tickets.

When Layers Play the Lottery, all Tickets Win at Initialization

TL;DR

The paper investigates pruning at initialization through layer removal within the Lottery Ticket Hypothesis framework. It shows that winning tickets can arise when entire layers are pruned and, crucially, can be discovered before training using data-driven criteria like GraSP, circumventing the need to train a dense network first. Empirically, layer-based winning tickets match or exceed dense-network accuracy across several settings, yield up to 2× training speedups and substantial CO2-emission reductions (up to 51%), and enhance robustness to adversarial and out-of-distribution inputs, outperforming standard filter-pruning tickets in initialization scenarios. This work introduces a new, architecture-agnostic direction for LTH that emphasizes layer-level pruning to achieve greener, faster, and more robust deep learning models in residual architectures.

Abstract

Pruning is a standard technique for reducing the computational cost of deep networks. Many advances in pruning leverage concepts from the Lottery Ticket Hypothesis (LTH). LTH reveals that inside a trained dense network exists sparse subnetworks (tickets) able to achieve similar accuracy (i.e., win the lottery - winning tickets). Pruning at initialization focuses on finding winning tickets without training a dense network. Studies on these concepts share the trend that subnetworks come from weight or filter pruning. In this work, we investigate LTH and pruning at initialization from the lens of layer pruning. First, we confirm the existence of winning tickets when the pruning process removes layers. Leveraged by this observation, we propose to discover these winning tickets at initialization, eliminating the requirement of heavy computational resources for training the initial (over-parameterized) dense network. Extensive experiments show that our winning tickets notably speed up the training phase and reduce up to 51% of carbon emission, an important step towards democratization and green Artificial Intelligence. Beyond computational benefits, our winning tickets exhibit robustness against adversarial and out-of-distribution examples. Finally, we show that our subnetworks easily win the lottery at initialization while tickets from filter removal (the standard structured LTH) hardly become winning tickets.
Paper Structure (8 sections, 2 equations, 4 figures, 5 tables, 2 algorithms)

This paper contains 8 sections, 2 equations, 4 figures, 5 tables, 2 algorithms.

Figures (4)

  • Figure 1: Lottery Ticket Hypothesis (LTH) views according to the structure (weights, neurons/filters or layers) the pruning is eliminating (transparent regions). Top-left. Original unstructured LTH: the pruning removes weights and yields unstructured tickets; thereby, the tickets only provide practical benefits on specialized frameworks for sparse computations. Top-right. Structured LTH: the pruning eliminates neurons/filters. In this setting, the tickets are structured and promote computational advantages to standard deep learning frameworks. Bottom-left: Ours structured LTH: the pruning eliminates entire layers, encouraging additional performance gains since it decreases the sequential processing (latency). Bottom-right. The highest gain (the higher, the better) obtained by a winning ticket regarding its dense counterpart. Our winning tickets successfully emerge at initialization, which means we can discover efficient subnetworks without training a dense network. In this direction, we can considerably speed up the learning phase by replacing a dense network with its sparse version before training begins. Our winning tickets also exhibit robustness against adversarial attacks.
  • Figure 2: Overall process to remove layers (residual models) from a residual network. After identifying a victim layer (dashed rectangle), we create a novel network (bottom) without it. Finally, we transfer the weights (red arrows) of the kept layers from the original unpruned network (top) to the new network.
  • Figure 3: $\ell_1$-norm score of layers of ResNet32. Layers within a stage operate on the same input/output spatial resolution (i.e., the size of the feature map -- values in parentheses).
  • Figure 4: Architecture of a residual-like network. The rationale behind this architecture is that the output of a layer takes into account the transformation performed by it ($f$) plus ($\oplus$) the input ($y$) it receives. Due to this essence, when we disable layer $i$ (its transformation -- dashed lines), the output (representation) of layer $i-1$ is propagated to layer $i+1$, which means that the output $y_i$ belongs $y_{i-1}$. For the sake of simplicity, we omit the batch normalization and activation layers.