Table of Contents
Fetching ...

Change Is the Only Constant: Dynamic LLM Slicing based on Layer Redundancy

Razvan-Gabriel Dumitru, Paul-Ioan Clotan, Vikas Yadav, Darius Peteleaza, Mihai Surdeanu

TL;DR

A novel model compression approach through dynamic layer-specific pruning in Large Language Models (LLMs), enhancing the traditional methodology established by SliceGPT by leveraging the newly proposed Layer Redundancy (LR) score.

Abstract

This paper introduces a novel model compression approach through dynamic layer-specific pruning in Large Language Models (LLMs), enhancing the traditional methodology established by SliceGPT. By transitioning from constant to dynamic slicing, our method leverages the newly proposed Layer Redundancy (LR) score, which assesses how much change each layer changes its input by measuring the cosine similarity of the input to the output of the layer. We use this score to prune parts of individual layers based on redundancy in such a way that the average pruned percentage for all layers is a fixed value. We conducted extensive experiments using models like Llama3-8B and Mistral-7B on multiple datasets, evaluating different slicing bases and percentages to determine optimal configurations that balance efficiency and performance. Our findings show that our dynamic slicing approach not only maintains but, in many cases, enhances model performance compared to the baseline established by constant slicing methods. For instance, in several settings, we see performance improvements of up to 5% over the SliceGPT baseline. Additionally, a perplexity decrease by as much as 7% was observed across multiple benchmarks, validating the effectiveness of our method. The code, model weights, and datasets are open-sourced at https://github.com/RazvanDu/DynamicSlicing.

Change Is the Only Constant: Dynamic LLM Slicing based on Layer Redundancy

TL;DR

A novel model compression approach through dynamic layer-specific pruning in Large Language Models (LLMs), enhancing the traditional methodology established by SliceGPT by leveraging the newly proposed Layer Redundancy (LR) score.

Abstract

This paper introduces a novel model compression approach through dynamic layer-specific pruning in Large Language Models (LLMs), enhancing the traditional methodology established by SliceGPT. By transitioning from constant to dynamic slicing, our method leverages the newly proposed Layer Redundancy (LR) score, which assesses how much change each layer changes its input by measuring the cosine similarity of the input to the output of the layer. We use this score to prune parts of individual layers based on redundancy in such a way that the average pruned percentage for all layers is a fixed value. We conducted extensive experiments using models like Llama3-8B and Mistral-7B on multiple datasets, evaluating different slicing bases and percentages to determine optimal configurations that balance efficiency and performance. Our findings show that our dynamic slicing approach not only maintains but, in many cases, enhances model performance compared to the baseline established by constant slicing methods. For instance, in several settings, we see performance improvements of up to 5% over the SliceGPT baseline. Additionally, a perplexity decrease by as much as 7% was observed across multiple benchmarks, validating the effectiveness of our method. The code, model weights, and datasets are open-sourced at https://github.com/RazvanDu/DynamicSlicing.

Paper Structure

This paper contains 14 sections, 4 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Example of the Layer Redundancy (LR) score as well as the transformations used to achieve a slice percentage of 30% ($S_P=0.3$) with a base slice for all layers of 10% ($S_B=0.1$). The example is shown for the 32 layers of llama3-8B.
  • Figure 2: Llama3-8B with 40% of the network sliced on average, the red line is the baseline accuracy achieved by SliceGPT with a constant 40% slice.
  • Figure 3: Llama3 8B with 30% of the network sliced on average, the red line is the baseline accuracy achieved by SliceGPT with a constant 30% slice.
  • Figure 4: Llama3 8B with 35% of the network sliced on average, the red line is the baseline accuracy achieved by SliceGPT with a constant 35% slice.
  • Figure 5: Mistral 7B with 30% of the network sliced on average, the red line is the baseline accuracy achieved by SliceGPT with a constant 30% slice.
  • ...and 2 more figures