Table of Contents
Fetching ...

Preserving Deep Representations In One-Shot Pruning: A Hessian-Free Second-Order Optimization Framework

Ryan Lucas, Rahul Mazumder

TL;DR

SNOWS, a one-shot post-training pruning framework aimed at reducing the cost of vision network inference without retraining, uses Hessian-free optimization to compute exact Newton descent steps without needing to compute or store the full Hessian matrix.

Abstract

We present SNOWS, a one-shot post-training pruning framework aimed at reducing the cost of vision network inference without retraining. Current leading one-shot pruning methods minimize layer-wise least squares reconstruction error which does not take into account deeper network representations. We propose to optimize a more global reconstruction objective. This objective accounts for nonlinear activations deep in the network to obtain a better proxy for the network loss. This nonlinear objective leads to a more challenging optimization problem -- we demonstrate it can be solved efficiently using a specialized second-order optimization framework. A key innovation of our framework is the use of Hessian-free optimization to compute exact Newton descent steps without needing to compute or store the full Hessian matrix. A distinct advantage of SNOWS is that it can be readily applied on top of any sparse mask derived from prior methods, readjusting their weights to exploit nonlinearities in deep feature representations. SNOWS obtains state-of-the-art results on various one-shot pruning benchmarks including residual networks and Vision Transformers (ViT/B-16 and ViT/L-16, 86m and 304m parameters respectively).

Preserving Deep Representations In One-Shot Pruning: A Hessian-Free Second-Order Optimization Framework

TL;DR

SNOWS, a one-shot post-training pruning framework aimed at reducing the cost of vision network inference without retraining, uses Hessian-free optimization to compute exact Newton descent steps without needing to compute or store the full Hessian matrix.

Abstract

We present SNOWS, a one-shot post-training pruning framework aimed at reducing the cost of vision network inference without retraining. Current leading one-shot pruning methods minimize layer-wise least squares reconstruction error which does not take into account deeper network representations. We propose to optimize a more global reconstruction objective. This objective accounts for nonlinear activations deep in the network to obtain a better proxy for the network loss. This nonlinear objective leads to a more challenging optimization problem -- we demonstrate it can be solved efficiently using a specialized second-order optimization framework. A key innovation of our framework is the use of Hessian-free optimization to compute exact Newton descent steps without needing to compute or store the full Hessian matrix. A distinct advantage of SNOWS is that it can be readily applied on top of any sparse mask derived from prior methods, readjusting their weights to exploit nonlinearities in deep feature representations. SNOWS obtains state-of-the-art results on various one-shot pruning benchmarks including residual networks and Vision Transformers (ViT/B-16 and ViT/L-16, 86m and 304m parameters respectively).

Paper Structure

This paper contains 26 sections, 37 equations, 16 figures, 8 tables, 3 algorithms.

Figures (16)

  • Figure 1: Effect of varying $K$ in the loss function in Eqn (\ref{['multi']}) on out-of-sample accuracy pruning ResNet20 on CIFAR-10 and ResNet50 on CIFAR-100 to 1:4 sparsity (74% and 66% respectively). Increasing $K$ improves the accuracy of the pruned network, at the cost of higher computation time.
  • Figure 1: Top-1 Test Accuracy integrating SNOWS with other popular mask selection algorithms for $N$:$M$ pruning.
  • Figure 2: Computional graph to minimize Eqn (\ref{['multi']}).
  • Figure 3: Comparing SNOWS to one-shot pruning methods for unstructured sparsity.
  • Figure 4: Visualizing (a) the original test image and attention maps from the last layer of (b) the dense VIT/B-16 model, (c) the model obtained by applying a 2:4 MP mask, and (d) the model after applying SNOWS on top of MP. SNOWS optimizes to reconstruct learned activations, preserving features learned by the dense network even in the deepest layers.
  • ...and 11 more figures