Table of Contents
Fetching ...

What Scalable Second-Order Information Knows for Pruning at Initialization

Ivo Gollini Navarrete, Nicolás Mauricio Cuadrado Ávila, Martin Takáč, Samuel Horváth

TL;DR

This paper addresses pruning at initialization (PaI) by leveraging scalable second-order information. It argues that diagonal estimators like the Hutchinson diagonal and the empirical Fisher diagonal capture essential curvature directions early in training, enabling effective one-shot pruning with linear time/space complexity. A BN statistics warmup is proposed to mitigate layer collapse and improve data-dependent criteria. Empirical results across CNNs and ViTs on CIFAR, TinyImageNet, and ImageNet show Hutchinson-based methods often outperform traditional baselines and substantially close the gap between PaI and PaT, establishing a practical, scalable approach to neural network pruning.

Abstract

Pruning remains an effective strategy for reducing both the costs and environmental impact associated with deploying large neural networks (NNs) while maintaining performance. Classical methods, such as OBD (LeCun et al., 1989) and OBS (Hassibi et al., 1992), demonstrate that utilizing curvature information can significantly enhance the balance between network complexity and performance. However, the computation and storage of the Hessian matrix make it impractical for modern NNs, motivating the use of approximations. Recent research (Gur et al., 2018; Karakida et al., 2019) suggests that the top eigenvalues guide optimization in a small subspace, are identifiable early, and remain consistent during training. Motivated by these findings, we revisit pruning at initialization (PaI) to evaluate scalable, unbiased second-order approximations, such as the Empirical Fisher and Hutchinson diagonals. Our experiments show that these methods capture sufficient curvature information to improve the identification of critical parameters compared to first-order baselines, while maintaining linear complexity. Additionally, we empirically demonstrate that updating batch normalization statistics as a warmup phase improves the performance of data-dependent criteria and mitigates the issue of layer collapse. Notably, Hutchinson-based criteria consistently outperformed or matched existing PaI algorithms across various models (including VGG, ResNet, and ViT) and datasets (such as CIFAR-10/100, TinyImageNet, and ImageNet). Our findings suggest that scalable second-order approximations strike an effective balance between computational efficiency and accuracy, making them a valuable addition to the pruning toolkit. We make our code available.

What Scalable Second-Order Information Knows for Pruning at Initialization

TL;DR

This paper addresses pruning at initialization (PaI) by leveraging scalable second-order information. It argues that diagonal estimators like the Hutchinson diagonal and the empirical Fisher diagonal capture essential curvature directions early in training, enabling effective one-shot pruning with linear time/space complexity. A BN statistics warmup is proposed to mitigate layer collapse and improve data-dependent criteria. Empirical results across CNNs and ViTs on CIFAR, TinyImageNet, and ImageNet show Hutchinson-based methods often outperform traditional baselines and substantially close the gap between PaI and PaT, establishing a practical, scalable approach to neural network pruning.

Abstract

Pruning remains an effective strategy for reducing both the costs and environmental impact associated with deploying large neural networks (NNs) while maintaining performance. Classical methods, such as OBD (LeCun et al., 1989) and OBS (Hassibi et al., 1992), demonstrate that utilizing curvature information can significantly enhance the balance between network complexity and performance. However, the computation and storage of the Hessian matrix make it impractical for modern NNs, motivating the use of approximations. Recent research (Gur et al., 2018; Karakida et al., 2019) suggests that the top eigenvalues guide optimization in a small subspace, are identifiable early, and remain consistent during training. Motivated by these findings, we revisit pruning at initialization (PaI) to evaluate scalable, unbiased second-order approximations, such as the Empirical Fisher and Hutchinson diagonals. Our experiments show that these methods capture sufficient curvature information to improve the identification of critical parameters compared to first-order baselines, while maintaining linear complexity. Additionally, we empirically demonstrate that updating batch normalization statistics as a warmup phase improves the performance of data-dependent criteria and mitigates the issue of layer collapse. Notably, Hutchinson-based criteria consistently outperformed or matched existing PaI algorithms across various models (including VGG, ResNet, and ViT) and datasets (such as CIFAR-10/100, TinyImageNet, and ImageNet). Our findings suggest that scalable second-order approximations strike an effective balance between computational efficiency and accuracy, making them a valuable addition to the pruning toolkit. We make our code available.

Paper Structure

This paper contains 27 sections, 22 equations, 16 figures, 21 tables.

Figures (16)

  • Figure 1: Relationship between curvature and parameter displacement. Each point corresponds to one eigendirection of the Hessian calculated at initialization.
  • Figure 2: Test accuracy across sparsity levels under various PaI methods for our base reference case of CIFAR-10 with ResNet18 on the left, and the more complex Imagenet-1K with ResNet50. The dashed gray line denotes the baseline accuracy without pruning.
  • Figure 3: (a) Effect of a warmup phase on training stability in CIFAR-10 with VGG19 at extreme sparsity ratios. Several methods exhibit systematic layer collapse without a warmup (left), leading to near-random performance. A warmup phase (right) largely prevents collapse and preserves accuracy across methods. (b) Importance of batch normalization statistics (BNS) update to prevent layer collapse. We studied the case of VGG19 pruning at $95\%$ sparsity.
  • Figure 4: Structured pruning of ViT-B/16 in CIFAR-10: Test accuracy across sparsities for all the importance-based metrics.
  • Figure : (a) Projection of $\Delta w$ after the first optimization step.
  • ...and 11 more figures