What Scalable Second-Order Information Knows for Pruning at Initialization

Ivo Gollini Navarrete; Nicolás Mauricio Cuadrado Ávila; Martin Takáč; Samuel Horváth

What Scalable Second-Order Information Knows for Pruning at Initialization

Ivo Gollini Navarrete, Nicolás Mauricio Cuadrado Ávila, Martin Takáč, Samuel Horváth

TL;DR

This paper addresses pruning at initialization (PaI) by leveraging scalable second-order information. It argues that diagonal estimators like the Hutchinson diagonal and the empirical Fisher diagonal capture essential curvature directions early in training, enabling effective one-shot pruning with linear time/space complexity. A BN statistics warmup is proposed to mitigate layer collapse and improve data-dependent criteria. Empirical results across CNNs and ViTs on CIFAR, TinyImageNet, and ImageNet show Hutchinson-based methods often outperform traditional baselines and substantially close the gap between PaI and PaT, establishing a practical, scalable approach to neural network pruning.

Abstract

Pruning remains an effective strategy for reducing both the costs and environmental impact associated with deploying large neural networks (NNs) while maintaining performance. Classical methods, such as OBD (LeCun et al., 1989) and OBS (Hassibi et al., 1992), demonstrate that utilizing curvature information can significantly enhance the balance between network complexity and performance. However, the computation and storage of the Hessian matrix make it impractical for modern NNs, motivating the use of approximations. Recent research (Gur et al., 2018; Karakida et al., 2019) suggests that the top eigenvalues guide optimization in a small subspace, are identifiable early, and remain consistent during training. Motivated by these findings, we revisit pruning at initialization (PaI) to evaluate scalable, unbiased second-order approximations, such as the Empirical Fisher and Hutchinson diagonals. Our experiments show that these methods capture sufficient curvature information to improve the identification of critical parameters compared to first-order baselines, while maintaining linear complexity. Additionally, we empirically demonstrate that updating batch normalization statistics as a warmup phase improves the performance of data-dependent criteria and mitigates the issue of layer collapse. Notably, Hutchinson-based criteria consistently outperformed or matched existing PaI algorithms across various models (including VGG, ResNet, and ViT) and datasets (such as CIFAR-10/100, TinyImageNet, and ImageNet). Our findings suggest that scalable second-order approximations strike an effective balance between computational efficiency and accuracy, making them a valuable addition to the pruning toolkit. We make our code available.

What Scalable Second-Order Information Knows for Pruning at Initialization

TL;DR

Abstract

What Scalable Second-Order Information Knows for Pruning at Initialization

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (16)