Evaluating Zero-Shot Long-Context LLM Compression

Chenyu Wang; Yihan Wang; Kai Li

Evaluating Zero-Shot Long-Context LLM Compression

Chenyu Wang, Yihan Wang, Kai Li

TL;DR

This work probes zero-shot compression of LLMs under long-context, using LLaMA-2-7B-32K to benchmark pruning (magnitude pruning and Wanda) and weight-only / weight-activation quantization. Theoretical analysis predicts cumulative computational error as context grows due to noise introduced by compression, and empirical results show pruning remains robust to context length while aggressive quantization degrades performance; a key finding is that selectively quantizing only about $2\%$ of high-magnitude weight groups to $8$-bit while quantizing the rest to $3$–$4$ bits can mitigate long-context degradation. The proposed hypothesis links weight-sensitivity to long-range dependencies and suggests targeted quantization strategies to preserve accuracy in extended contexts. These insights could enable more efficient deployment of LLMs with very long input windows without incurring prohibitive accuracy losses.

Abstract

This study evaluates the effectiveness of zero-shot compression techniques on large language models (LLMs) under long-context. We identify the tendency for computational errors to increase under long-context when employing certain compression methods. We propose a hypothesis to explain the varied behavior of different LLM compression techniques and explore remedies to mitigate the performance decline observed in some techniques under long-context. This is a course report for COS 598D Machine Learning and Systems by Prof. Kai Li at Princeton University. Due to limited computational resources, our experiments were conducted only on LLaMA-2-7B-32K.

Evaluating Zero-Shot Long-Context LLM Compression

TL;DR

of high-magnitude weight groups to

-bit while quantizing the rest to

–

bits can mitigate long-context degradation. The proposed hypothesis links weight-sensitivity to long-range dependencies and suggests targeted quantization strategies to preserve accuracy in extended contexts. These insights could enable more efficient deployment of LLMs with very long input windows without incurring prohibitive accuracy losses.

Abstract

Paper Structure (13 sections, 4 equations, 5 figures)

This paper contains 13 sections, 4 equations, 5 figures.

Introduction
Related Works
Long-Context LLMs
Quantization in LLMs
Pruning in LLMs
Concurrent Work
Evaluation Details
Theoretical Analysis
Empirical Evaluation
Hypothesis on the Varied Behaviors
Conclusion
Future Work
Acknowledgements

Figures (5)

Figure 1: Pruning algorithms are robust to context lengths. The KL divergence of output logits between the uncompressed model and the pruned models does not change much with respect to different context lengths. The pruning ratio will only affect the variance of KL divergence values measured in different context lengths.
Figure 2: Only about 2% weights are sensitive to low-bit quantizations. We can use 8-bit quantization instead of 3/4-bit quantization for these weights to make the compressed models less sensitive to context lengths.
Figure 3: When we use low-bit ($\le 4$) weight quantization, the performance of compressed models becomes more sensitive to context lengths: the output of compressed models become more different from the of the uncompressed model when the context length increases.
Figure 4: When choosing 3-bit quantization, it is very obvious that the output of compressed models become more different from the of the uncompressed model when the context length increases.
Figure 5: If we randomly prune out only 10% weights, we can observe an almost-linearly increasing KL divergence as context length increases.

Evaluating Zero-Shot Long-Context LLM Compression

TL;DR

Abstract

Evaluating Zero-Shot Long-Context LLM Compression

Authors

TL;DR

Abstract

Table of Contents

Figures (5)