Table of Contents
Fetching ...

MultiPruner: Balanced Structure Removal in Foundation Models

J. Pablo Muñoz, Jinjie Yuan, Nilesh Jain

TL;DR

MultiPruner introduces a training-free, multidimensional pruning framework for foundation models that balances depth and width by sequentially pruning residual blocks, MLP channels, and attention heads. By using fixed targets and, where appropriate, evolutionary search (NSGA-II), it achieves competitive or superior zero-shot downstream performance while yielding smaller, faster models. The approach demonstrates strong results across multiple LLMs and pruning ratios, with ablations and sensitivity analyses underscoring the importance of the pruning order and weight reordering. Recovery tuning further enhances performance, making pruned models viable for deployment on resource-constrained hardware with notable inference speedups.

Abstract

Recently, state-of-the-art approaches for pruning large pre-trained models (LPMs) have demonstrated that the training-free removal of non-critical residual blocks in Transformers is viable for reducing model size, achieving results that outperform previous training-free pruning approaches. Motivated by these findings, we extend BlockPruner (Zhong et al., 2024) and propose MultiPruner, a pruning approach that surpasses recent training-free pruning methods by adopting a multidimensional, iterative, fine-grained pruning strategy. In MultiPruner, multidimensional pruning reinstates the structural balance in block-pruned models by sequentially compressing along three dimensions: i) residual blocks, ii) channels of multilayer perceptrons (MLP), and iii) attention heads. This solution enhances zero-shot accuracy on downstream tasks compared to other techniques while improving model compression ratios, producing compressed models with fewer computing and memory requirements. Extensive experiments demonstrate the advantages of the proposed method across various large pre-trained models. The code and pruning configurations are available at https://github.com/IntelLabs/Hardware-Aware-Automated-Machine-Learning.

MultiPruner: Balanced Structure Removal in Foundation Models

TL;DR

MultiPruner introduces a training-free, multidimensional pruning framework for foundation models that balances depth and width by sequentially pruning residual blocks, MLP channels, and attention heads. By using fixed targets and, where appropriate, evolutionary search (NSGA-II), it achieves competitive or superior zero-shot downstream performance while yielding smaller, faster models. The approach demonstrates strong results across multiple LLMs and pruning ratios, with ablations and sensitivity analyses underscoring the importance of the pruning order and weight reordering. Recovery tuning further enhances performance, making pruned models viable for deployment on resource-constrained hardware with notable inference speedups.

Abstract

Recently, state-of-the-art approaches for pruning large pre-trained models (LPMs) have demonstrated that the training-free removal of non-critical residual blocks in Transformers is viable for reducing model size, achieving results that outperform previous training-free pruning approaches. Motivated by these findings, we extend BlockPruner (Zhong et al., 2024) and propose MultiPruner, a pruning approach that surpasses recent training-free pruning methods by adopting a multidimensional, iterative, fine-grained pruning strategy. In MultiPruner, multidimensional pruning reinstates the structural balance in block-pruned models by sequentially compressing along three dimensions: i) residual blocks, ii) channels of multilayer perceptrons (MLP), and iii) attention heads. This solution enhances zero-shot accuracy on downstream tasks compared to other techniques while improving model compression ratios, producing compressed models with fewer computing and memory requirements. Extensive experiments demonstrate the advantages of the proposed method across various large pre-trained models. The code and pruning configurations are available at https://github.com/IntelLabs/Hardware-Aware-Automated-Machine-Learning.
Paper Structure (30 sections, 6 figures, 10 tables, 2 algorithms)

This paper contains 30 sections, 6 figures, 10 tables, 2 algorithms.

Figures (6)

  • Figure 1: MultiPruner adopts a multidimensional fine-grained pruning method to make pruning more balanced, resulting in a higher-performance pruned model.
  • Figure 2: Search progression for evolutionary search (NSGA2) on width dimension (Llama2-7B), starting from a block-pruned model with a pruning ratio of 11%. The red line is positioned at 22% (the target pruning ratio). We select the subnetwork with the lowest PPL value on this line as the final pruned model. This search progression figure omits subnetworks with PPL $>$ 100.
  • Figure 3: Comparison of BlockPruner, MultiPruner and MultiPruner-Evol at different pruning ratios for Llama2-7B. Average score means the average accuracy across five downstream tasks.
  • Figure 4: The results of increasing/decreasing the weight of the target pruning ratio allocated to pruning MLP Channels or Attention Heads (Llama2-7B with a target ratio of 22%). For example, an MLP ratio weight of 50% means that the pruning ratio for the MLP channel pruning stage is 22% $\times$ 50% = 11%. The yellow point represents the ratio weight we adopted in the most experimental results, which is Block : MLP Channel : Attention Head = 44% : 52% : 4%. Note that the average score means the average accuracy across five downstream tasks.
  • Figure 5: Details of the pruned Llama2-7B model obtained by MultiPruner, including the width of the self-attention and MLP modules across different layers. The numbers within the colored boxes represent the channel sizes, while the white boxes indicate blocks that have been completely removed.
  • ...and 1 more figures