Determining Layer-wise Sparsity for Large Language Models Through a Theoretical Perspective
Weizhong Huang, Yuxin Zhang, Xiawu Zheng, Fei Chao, Rongrong Ji
TL;DR
This work addresses the challenge of determining layer-wise sparsity in large language models by identifying a reconstruction error explosion when sparsity is allocated non-optimally across layers. It introduces ATP, a theoretically motivated sparsity scheme where per-layer sparsity $s_i$ follows a monotone arithmetic progression with average sparsity $S$ and common difference $\beta$, efficiently found via grid search. Through rigorous theoretical analysis and extensive experiments across diverse LLMs and modalities, ATP achieves notable improvements in perplexity, zero-shot accuracy, and inference speed while maintaining near-optimal allocations compared with Bayesian search. The approach demonstrates broad applicability to vision and multimodal models and can be integrated with various compression techniques, offering a practical and principled path to high-performance sparse models. Overall, ATP provides a principled, efficient, and widely applicable method for layer-wise sparsity allocation with substantial practical impact on compressed LLMs.
Abstract
In this paper, we address the challenge of determining the layer-wise sparsity rates of large language models (LLMs) through a theoretical perspective. Specifically, we identify a critical issue of ''$\textbf{reconstruction error explosion}$'' in existing LLMs sparsification methods. This refers to the cumulative effect of reconstruction errors throughout the sparsification process, where errors from earlier layers propagate and amplify in subsequent layers. As a result, the overall reconstruction error increases significantly, leading to a substantial degradation in model performance. Through theoretical analysis, we derive a simple yet effective approach to layer-wise sparsity allocation that mitigates this issue. Our method uses a monotonically increasing arithmetic progression, reducing the process of determining sparsity rates for multiple layers to the determination of a single common difference hyperparameter. Remarkably, this allows for the optimal layer-wise sparsity rates to be identified with just a few trials. Both our theoretical analysis and experimental results demonstrate that this sparsity allocation scheme is near optimal. Extensive experiments show that our method significantly improves the performance of sparse LLMs across various architectures, outperforming existing layer-wise sparsity methods. Furthermore, it enhances the performance of various compression techniques and is applicable to vision and multimodal models. Notably, our method achieves a reduction of 52.10 in perplexity for the 70$\%$ sparse LLaMA2-7B model obtained via Wanda, improves average zero-shot accuracy by 10.50$\%$, and delivers speedups of 2.63$\times$ and 2.23$\times$ on CPU and GPU, respectively.
