Table of Contents
Fetching ...

Zeroth-Order Adaptive Neuron Alignment Based Pruning without Re-Training

Elia Cunegatti, Leonardo Lucio Custode, Giovanni Iacca

TL;DR

This paper proposes NeuroAl, a top-up algorithm that can be used on top of any given pruning algorithm for LLMs, which modifies the block-wise and row-wise sparsity, exploiting information from both the dense model and its sparse version to maximize the neuron alignment among activations.

Abstract

Network pruning focuses on algorithms that aim to reduce a given model's computational cost by removing a subset of its parameters while having minimal impact on performance. Throughout the last decade, the most widely used pruning paradigm has been pruning and re-training, which nowadays is inconvenient due to the vast amount of pre-trained models, which are, in any case, too expensive to re-train. In this paper, we exploit functional information from dense pre-trained models, i.e., their input activations, to obtain sparse models that maximize the activations' alignment with respect to their corresponding dense models. Hence, we propose \textbf{NeuroAl}, a \emph{top-up} algorithm that can be used on top of any given pruning algorithm for LLMs, which modifies the block-wise and row-wise sparsity, exploiting information from both the dense model and its sparse version to maximize the \emph{neuron alignment} among activations. Different from existing methods, our approach adaptively selects the best hyperparameters for the block-wise and row-wise sparsity ratios w.r.t. the model and the desired sparsity, and requires \emph{no re-training}. We test our method over $\sim$300 test cases with four LLM families, three sparsity ratios, and ten language tasks (three language modeling and seven zero-shot datasets), showing how it consistently outperforms the latest state-of-the-art methods in terms of performance-runtime trade-off. The code is available at \href{https://github.com/eliacunegatti/NeuroAL}{https://github.com/eliacunegatti/NeuroAL}.

Zeroth-Order Adaptive Neuron Alignment Based Pruning without Re-Training

TL;DR

This paper proposes NeuroAl, a top-up algorithm that can be used on top of any given pruning algorithm for LLMs, which modifies the block-wise and row-wise sparsity, exploiting information from both the dense model and its sparse version to maximize the neuron alignment among activations.

Abstract

Network pruning focuses on algorithms that aim to reduce a given model's computational cost by removing a subset of its parameters while having minimal impact on performance. Throughout the last decade, the most widely used pruning paradigm has been pruning and re-training, which nowadays is inconvenient due to the vast amount of pre-trained models, which are, in any case, too expensive to re-train. In this paper, we exploit functional information from dense pre-trained models, i.e., their input activations, to obtain sparse models that maximize the activations' alignment with respect to their corresponding dense models. Hence, we propose \textbf{NeuroAl}, a \emph{top-up} algorithm that can be used on top of any given pruning algorithm for LLMs, which modifies the block-wise and row-wise sparsity, exploiting information from both the dense model and its sparse version to maximize the \emph{neuron alignment} among activations. Different from existing methods, our approach adaptively selects the best hyperparameters for the block-wise and row-wise sparsity ratios w.r.t. the model and the desired sparsity, and requires \emph{no re-training}. We test our method over 300 test cases with four LLM families, three sparsity ratios, and ten language tasks (three language modeling and seven zero-shot datasets), showing how it consistently outperforms the latest state-of-the-art methods in terms of performance-runtime trade-off. The code is available at \href{https://github.com/eliacunegatti/NeuroAL}{https://github.com/eliacunegatti/NeuroAL}.

Paper Structure

This paper contains 43 sections, 13 equations, 7 figures, 36 tables, 2 algorithms.

Figures (7)

  • Figure 1: Perplexity vs. Runtime (seconds) trade-off among different top-up algorithms and our proposed NeuronAl based on LLama-1 7B with a sparsity of 70%, evaluated on WikiText2.
  • Figure 2: Perplexity for various hyperparameter settings of OWL ($M$,$\lambda$) and AlphaPruning ($\epsilon$) using Phi-2 and LLama-1 7B for three sparsity ratios. The gray square corresponds to the hyperparameter values that lead to the best performance.
  • Figure 3: Left: Overall NeuronAl top-up pruning procedure. Right: GetBestNeuronAL sub-routine used in both block- and row-selection stages.
  • Figure 4: Perplexity over different values of $\lambda$ at 70% sparsity. The orange dot indicates the value selected by NeuronAl using neuron alignment. The green dot indicates the value selected by NeuronAl using the reconstruction error rather than the neuron alignment (see Section \ref{['subsubsec:neuronalg_vs_rec_err']}).
  • Figure 5: Perplexity over different values of $|C_{\lambda}|$ (size of the calibration data) when using NeuronAl on the three Language Modeling datasets at 70% sparsity.
  • ...and 2 more figures