Table of Contents
Fetching ...

Scalable iterative pruning of large language and vision models using block coordinate descent

Gili Rosenberg, J. Kyle Brubaker, Martin J. A. Schuetz, Elton Yechao Zhu, Serdar Kadıoğlu, Sima E. Borujeni, Helmut G. Katzgraber

TL;DR

The iterative, block-based nature of this pruning technique, which is dubbed ``iterative Combinatorial Brain Surgeon'' (iCBS), allows for scalability to very large models, including large language models (LLMs), that may not be feasible with a one-shot combinatorial optimization approach.

Abstract

Pruning neural networks, which involves removing a fraction of their weights, can often maintain high accuracy while significantly reducing model complexity, at least up to a certain limit. We present a neural network pruning technique that builds upon the Combinatorial Brain Surgeon, but solves an optimization problem over a subset of the network weights in an iterative, block-wise manner using block coordinate descent. The iterative, block-based nature of this pruning technique, which we dub ``iterative Combinatorial Brain Surgeon'' (iCBS) allows for scalability to very large models, including large language models (LLMs), that may not be feasible with a one-shot combinatorial optimization approach. When applied to large models like Mistral and DeiT, iCBS achieves higher performance metrics at the same density levels compared to existing pruning methods such as Wanda. This demonstrates the effectiveness of this iterative, block-wise pruning method in compressing and optimizing the performance of large deep learning models, even while optimizing over only a small fraction of the weights. Moreover, our approach allows for a quality-time (or cost) tradeoff that is not available when using a one-shot pruning technique alone. The block-wise formulation of the optimization problem enables the use of hardware accelerators, potentially offsetting the increased computational costs compared to one-shot pruning methods like Wanda. In particular, the optimization problem solved for each block is quantum-amenable in that it could, in principle, be solved by a quantum computer.

Scalable iterative pruning of large language and vision models using block coordinate descent

TL;DR

The iterative, block-based nature of this pruning technique, which is dubbed ``iterative Combinatorial Brain Surgeon'' (iCBS), allows for scalability to very large models, including large language models (LLMs), that may not be feasible with a one-shot combinatorial optimization approach.

Abstract

Pruning neural networks, which involves removing a fraction of their weights, can often maintain high accuracy while significantly reducing model complexity, at least up to a certain limit. We present a neural network pruning technique that builds upon the Combinatorial Brain Surgeon, but solves an optimization problem over a subset of the network weights in an iterative, block-wise manner using block coordinate descent. The iterative, block-based nature of this pruning technique, which we dub ``iterative Combinatorial Brain Surgeon'' (iCBS) allows for scalability to very large models, including large language models (LLMs), that may not be feasible with a one-shot combinatorial optimization approach. When applied to large models like Mistral and DeiT, iCBS achieves higher performance metrics at the same density levels compared to existing pruning methods such as Wanda. This demonstrates the effectiveness of this iterative, block-wise pruning method in compressing and optimizing the performance of large deep learning models, even while optimizing over only a small fraction of the weights. Moreover, our approach allows for a quality-time (or cost) tradeoff that is not available when using a one-shot pruning technique alone. The block-wise formulation of the optimization problem enables the use of hardware accelerators, potentially offsetting the increased computational costs compared to one-shot pruning methods like Wanda. In particular, the optimization problem solved for each block is quantum-amenable in that it could, in principle, be solved by a quantum computer.

Paper Structure

This paper contains 14 sections, 9 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Pruning neural network (NN) weights. An example fully connected (feedforward) NN with three input nodes, one hidden layer with four nodes, two output nodes, and weights denoted by $\bm{w^0}$ (left) is pruned, removing half of the weights connecting the input layer with the hidden layer, resulting in a much sparser NN (right) with weights denoted by $\bm{w'}$. The density $d=0.7$ of the pruned model is the ratio between the number of weights in the pruned model (right) divided by the number of weights in the original model (left), and reflects a reduction of 30% in the number of weights.
  • Figure 2: Schematic illustration of the per-block pruning process in iCBS. The init_method weight-scoring method is used to score all the weights at the beginning. Weights with extreme scores are fixed to always be pruned/kept, meaning that these weights are taken out of the candidate pool for the per-block optimization. Then, in each step, the selection_method weight-scoring method is used to score the non-fixed (free) weights. Then, the weights to be optimized over are selected from the currently pruned/kept sets, taking into account which weights are tabu-ed. Then the gradient and Hessian are estimated for that block, an optimization problem is constructed and passed to the block optimizer. The block optimizer solves the per-block optimization problem and returns a solution. Finally, the currently kept / pruned sets are updated and the tabu list is updated (not pictured).
  • Figure 3: Results for the Garment Classifier model on the Fashion-MNIST dataset. This plot shows the dependence of the final (post-pruning) validation accuracy (top-1) on the density for various types of pruning -- the baselines and our pruner iCBS. The horizontal line labeled "No pruning" shows the validation accuracy of the unpruned model. Error bars are included for all baselines except magnitude (since it is deterministic) and show the standard deviation over 30 random repetitions.
  • Figure 4: Results for the DeiT model on the ImageNet-1K dataset. These plots shows the dependence of the post-pruning validation accuracy (top-1) on the density for various types of pruning -- the baselines and our pruner iCBS. The horizontal line labeled "No pruning" shows the validation accuracy of the unpruned model.
  • Figure 5: Results for the Mistral-7b model pruned using the C4 dataset and validated using the LM Evaluation Harness. These plots shows the dependence of the post-pruning validation accuracy (top-1) on the density for various types of pruning -- the baselines and our pruner iCBS. The horizontal line labeled "No pruning" shows the validation accuracy of the unpruned model.
  • ...and 1 more figures

Theorems & Definitions (1)

  • Definition 1: NN Pruning Problem