Table of Contents
Fetching ...

Determining Layer-wise Sparsity for Large Language Models Through a Theoretical Perspective

Weizhong Huang, Yuxin Zhang, Xiawu Zheng, Fei Chao, Rongrong Ji

TL;DR

This work addresses the challenge of determining layer-wise sparsity in large language models by identifying a reconstruction error explosion when sparsity is allocated non-optimally across layers. It introduces ATP, a theoretically motivated sparsity scheme where per-layer sparsity $s_i$ follows a monotone arithmetic progression with average sparsity $S$ and common difference $\beta$, efficiently found via grid search. Through rigorous theoretical analysis and extensive experiments across diverse LLMs and modalities, ATP achieves notable improvements in perplexity, zero-shot accuracy, and inference speed while maintaining near-optimal allocations compared with Bayesian search. The approach demonstrates broad applicability to vision and multimodal models and can be integrated with various compression techniques, offering a practical and principled path to high-performance sparse models. Overall, ATP provides a principled, efficient, and widely applicable method for layer-wise sparsity allocation with substantial practical impact on compressed LLMs.

Abstract

In this paper, we address the challenge of determining the layer-wise sparsity rates of large language models (LLMs) through a theoretical perspective. Specifically, we identify a critical issue of ''$\textbf{reconstruction error explosion}$'' in existing LLMs sparsification methods. This refers to the cumulative effect of reconstruction errors throughout the sparsification process, where errors from earlier layers propagate and amplify in subsequent layers. As a result, the overall reconstruction error increases significantly, leading to a substantial degradation in model performance. Through theoretical analysis, we derive a simple yet effective approach to layer-wise sparsity allocation that mitigates this issue. Our method uses a monotonically increasing arithmetic progression, reducing the process of determining sparsity rates for multiple layers to the determination of a single common difference hyperparameter. Remarkably, this allows for the optimal layer-wise sparsity rates to be identified with just a few trials. Both our theoretical analysis and experimental results demonstrate that this sparsity allocation scheme is near optimal. Extensive experiments show that our method significantly improves the performance of sparse LLMs across various architectures, outperforming existing layer-wise sparsity methods. Furthermore, it enhances the performance of various compression techniques and is applicable to vision and multimodal models. Notably, our method achieves a reduction of 52.10 in perplexity for the 70$\%$ sparse LLaMA2-7B model obtained via Wanda, improves average zero-shot accuracy by 10.50$\%$, and delivers speedups of 2.63$\times$ and 2.23$\times$ on CPU and GPU, respectively.

Determining Layer-wise Sparsity for Large Language Models Through a Theoretical Perspective

TL;DR

This work addresses the challenge of determining layer-wise sparsity in large language models by identifying a reconstruction error explosion when sparsity is allocated non-optimally across layers. It introduces ATP, a theoretically motivated sparsity scheme where per-layer sparsity follows a monotone arithmetic progression with average sparsity and common difference , efficiently found via grid search. Through rigorous theoretical analysis and extensive experiments across diverse LLMs and modalities, ATP achieves notable improvements in perplexity, zero-shot accuracy, and inference speed while maintaining near-optimal allocations compared with Bayesian search. The approach demonstrates broad applicability to vision and multimodal models and can be integrated with various compression techniques, offering a practical and principled path to high-performance sparse models. Overall, ATP provides a principled, efficient, and widely applicable method for layer-wise sparsity allocation with substantial practical impact on compressed LLMs.

Abstract

In this paper, we address the challenge of determining the layer-wise sparsity rates of large language models (LLMs) through a theoretical perspective. Specifically, we identify a critical issue of '''' in existing LLMs sparsification methods. This refers to the cumulative effect of reconstruction errors throughout the sparsification process, where errors from earlier layers propagate and amplify in subsequent layers. As a result, the overall reconstruction error increases significantly, leading to a substantial degradation in model performance. Through theoretical analysis, we derive a simple yet effective approach to layer-wise sparsity allocation that mitigates this issue. Our method uses a monotonically increasing arithmetic progression, reducing the process of determining sparsity rates for multiple layers to the determination of a single common difference hyperparameter. Remarkably, this allows for the optimal layer-wise sparsity rates to be identified with just a few trials. Both our theoretical analysis and experimental results demonstrate that this sparsity allocation scheme is near optimal. Extensive experiments show that our method significantly improves the performance of sparse LLMs across various architectures, outperforming existing layer-wise sparsity methods. Furthermore, it enhances the performance of various compression techniques and is applicable to vision and multimodal models. Notably, our method achieves a reduction of 52.10 in perplexity for the 70 sparse LLaMA2-7B model obtained via Wanda, improves average zero-shot accuracy by 10.50, and delivers speedups of 2.63 and 2.23 on CPU and GPU, respectively.

Paper Structure

This paper contains 50 sections, 5 theorems, 33 equations, 4 figures, 22 tables, 1 algorithm.

Key Result

Theorem 3.1

Increasing the sparsity of the weights in the $i$-th layer will lead to an increase in the reconstruction error of this layer.

Figures (4)

  • Figure 1: (Left) shows the comparison of reconstruction error among different layer-wise sparsity methods. All methods face the problem of "reconstruction error explosion"; however, our method achieves lower reconstruction error compared to other methods. (Right) presents a comparison between our method and other layer-wise sparsity methods. The metric-based method calculates the importance of each layer to obtain the sparsity rate. However, this method is heuristically designed by human experts and is not optimal. And the search-based method requires a large number of iterative searches, which is time-consuming. In contrast, we analyze the causes of "reconstruction error explosion" from a theoretical perspective, and deduce theoretically that using a monotonically increasing arithmetic progression to determine the layer-wise sparsity rate can alleviate the problem of "reconstruction error explosion".
  • Figure 2: Comparison of layer-wise sparsity rate distributions at different average sparsity levels.
  • Figure 3: Comparison of layer-wise sparsity rate distribution with other methods.
  • Figure 4: Impact of different $\beta$ settings on the perplexity of the 70$\%$ sparse LLaMA2-7B.

Theorems & Definitions (10)

  • Theorem 3.1: Effect of increased sparsity on reconstruction error
  • proof : Proof of Theorem \ref{['theorem1']}
  • Theorem 3.2: The cumulative effect of reconstruction error
  • Lemma 3.3
  • proof : Proof of Theorem \ref{['theorem2']}
  • Theorem 3.4: Impact of the sparsity of the previous layer on the reconstruction error of the next layer.
  • proof : Proof of Theorem \ref{['theorem3']}
  • Theorem 3.5
  • proof : Proof of Lemma \ref{['lemma1']}
  • proof : Proof of Theorem \ref{['theorem4']}