Table of Contents
Fetching ...

Subspace Node Pruning

Joshua Offergeld, Marcel van Gerven, Nasir Ahmad

TL;DR

This work tackles the challenge of reducing neural network inference cost without sacrificing accuracy. It introduces Subspace Node Pruning (SNP), which orthogonalizes layer activations into a lower-triangular subspace and uses linear least squares to reconstruct the impact of pruned units, while determining pruning ratios from cumulative variance. A novel ordering via unnormalized-ZCA captures unit redundancy, and LDL-based subspace transforms enable automatic, globally coordinated pruning across layers. The method achieves state-of-the-art or competitive results on ImageNet models (VGG-16, ResNet-50, DeiT) and demonstrates effective one-shot pruning on OPT, all with substantially reduced compute and without heavy per-layer tuning. Overall, SNP provides a simple, interpretable, and scalable framework for efficient pruning across CNNs and transformers, with broad applicability and potential for further refinements during training or dynamic pruning.

Abstract

Improving the efficiency of neural network inference is undeniably important in a time where commercial use of AI models increases daily. Node pruning is the art of removing computational units such as neurons, filters, attention heads, or even entire layers to significantly reduce inference time while retaining network performance. In this work, we propose the projection of unit activations to an orthogonal subspace in which there is no redundant activity and within which we may prune nodes while simultaneously recovering the impact of lost units via linear least squares. We furthermore show that the order in which units are orthogonalized can be optimized to maximally rank units by their redundancy. Finally, we leverage these orthogonal subspaces to automatically determine layer-wise pruning ratios based upon the relative scale of node activations in our subspace, equivalent to cumulative variance. Our method matches or exceeds state-of-the-art pruning results on ImageNet-trained VGG-16, ResNet-50 and DeiT models while simultaneously having up to 24x lower computational cost than alternative methods. We also demonstrate that this method can be applied in a one-shot manner to OPT LLM models, again outperforming competing methods.

Subspace Node Pruning

TL;DR

This work tackles the challenge of reducing neural network inference cost without sacrificing accuracy. It introduces Subspace Node Pruning (SNP), which orthogonalizes layer activations into a lower-triangular subspace and uses linear least squares to reconstruct the impact of pruned units, while determining pruning ratios from cumulative variance. A novel ordering via unnormalized-ZCA captures unit redundancy, and LDL-based subspace transforms enable automatic, globally coordinated pruning across layers. The method achieves state-of-the-art or competitive results on ImageNet models (VGG-16, ResNet-50, DeiT) and demonstrates effective one-shot pruning on OPT, all with substantially reduced compute and without heavy per-layer tuning. Overall, SNP provides a simple, interpretable, and scalable framework for efficient pruning across CNNs and transformers, with broad applicability and potential for further refinements during training or dynamic pruning.

Abstract

Improving the efficiency of neural network inference is undeniably important in a time where commercial use of AI models increases daily. Node pruning is the art of removing computational units such as neurons, filters, attention heads, or even entire layers to significantly reduce inference time while retaining network performance. In this work, we propose the projection of unit activations to an orthogonal subspace in which there is no redundant activity and within which we may prune nodes while simultaneously recovering the impact of lost units via linear least squares. We furthermore show that the order in which units are orthogonalized can be optimized to maximally rank units by their redundancy. Finally, we leverage these orthogonal subspaces to automatically determine layer-wise pruning ratios based upon the relative scale of node activations in our subspace, equivalent to cumulative variance. Our method matches or exceeds state-of-the-art pruning results on ImageNet-trained VGG-16, ResNet-50 and DeiT models while simultaneously having up to 24x lower computational cost than alternative methods. We also demonstrate that this method can be applied in a one-shot manner to OPT LLM models, again outperforming competing methods.
Paper Structure (26 sections, 15 equations, 10 figures, 5 tables, 2 algorithms)

This paper contains 26 sections, 15 equations, 10 figures, 5 tables, 2 algorithms.

Figures (10)

  • Figure 1: Graphical depiction of our three main method and contributions. From left to right: The construction of a subspace in which nodes can be pruned with automated reconstruction, a theoretically sound importance scoring method which aligns with our subspace construction, and finally an automated method based upon cumulative variance for selecting automatically selecting pruning ratios for all layers of a network.
  • Figure 2: Our choice of a subspace which is constructed for lower-triangular matrices is here justified. Left: If a dense matrix is used to form a subspace, pruning does not prune the original input nodes. Right: When pruning a lower-triangular transformation matrix, pruning the bottom row corresponds to pruning away an entire input node.
  • Figure 3: Latent unit variances in our subspace after Gram-Schmidt orthogonalization of layer 12 from VGG-16. Prior to the orthogonalization, the units are ordered either randomly (left), ordered using the SAW importance measure (middle) and ordered using our proposed ordering by unnormalized-ZCA variances (right).
  • Figure 4: Left: Performance prior to retraining; Right: after retraining. A comparison of our subspace pruning (SNP) with our local ZCA-based importance (ZCA) and global variance cutoff (var) vs baseline methods on VGG-16. See Appendix \ref{['app:hyp-overview']} for details on the baseline methods. The black strided horizontal line (right panel) shows the initial network performance before pruning. SNP-ZCA has error bars on top of the datapoints from three randomly seeded training runs, though these are barely distinguishable. PFA-EN is the only unique method which uses PCA to determine global importance, indicated by the dashed line.
  • Figure 5: A comparison of pruning different groups of layers of ResNet-50. We compare our SNP-ZCA method with the variance heuristic against Intra-Fusion (IF). Further, we demonstrate our method when only using 1024 samples (SNP 1024) or white noise (SNP wn) is input to the network while constructing the Gram matrices.
  • ...and 5 more figures