Table of Contents
Fetching ...

Evaluating Zero-Shot Long-Context LLM Compression

Chenyu Wang, Yihan Wang, Kai Li

TL;DR

This work probes zero-shot compression of LLMs under long-context, using LLaMA-2-7B-32K to benchmark pruning (magnitude pruning and Wanda) and weight-only / weight-activation quantization. Theoretical analysis predicts cumulative computational error as context grows due to noise introduced by compression, and empirical results show pruning remains robust to context length while aggressive quantization degrades performance; a key finding is that selectively quantizing only about $2\%$ of high-magnitude weight groups to $8$-bit while quantizing the rest to $3$–$4$ bits can mitigate long-context degradation. The proposed hypothesis links weight-sensitivity to long-range dependencies and suggests targeted quantization strategies to preserve accuracy in extended contexts. These insights could enable more efficient deployment of LLMs with very long input windows without incurring prohibitive accuracy losses.

Abstract

This study evaluates the effectiveness of zero-shot compression techniques on large language models (LLMs) under long-context. We identify the tendency for computational errors to increase under long-context when employing certain compression methods. We propose a hypothesis to explain the varied behavior of different LLM compression techniques and explore remedies to mitigate the performance decline observed in some techniques under long-context. This is a course report for COS 598D Machine Learning and Systems by Prof. Kai Li at Princeton University. Due to limited computational resources, our experiments were conducted only on LLaMA-2-7B-32K.

Evaluating Zero-Shot Long-Context LLM Compression

TL;DR

This work probes zero-shot compression of LLMs under long-context, using LLaMA-2-7B-32K to benchmark pruning (magnitude pruning and Wanda) and weight-only / weight-activation quantization. Theoretical analysis predicts cumulative computational error as context grows due to noise introduced by compression, and empirical results show pruning remains robust to context length while aggressive quantization degrades performance; a key finding is that selectively quantizing only about of high-magnitude weight groups to -bit while quantizing the rest to bits can mitigate long-context degradation. The proposed hypothesis links weight-sensitivity to long-range dependencies and suggests targeted quantization strategies to preserve accuracy in extended contexts. These insights could enable more efficient deployment of LLMs with very long input windows without incurring prohibitive accuracy losses.

Abstract

This study evaluates the effectiveness of zero-shot compression techniques on large language models (LLMs) under long-context. We identify the tendency for computational errors to increase under long-context when employing certain compression methods. We propose a hypothesis to explain the varied behavior of different LLM compression techniques and explore remedies to mitigate the performance decline observed in some techniques under long-context. This is a course report for COS 598D Machine Learning and Systems by Prof. Kai Li at Princeton University. Due to limited computational resources, our experiments were conducted only on LLaMA-2-7B-32K.
Paper Structure (13 sections, 4 equations, 5 figures)

This paper contains 13 sections, 4 equations, 5 figures.

Figures (5)

  • Figure 1: Pruning algorithms are robust to context lengths. The KL divergence of output logits between the uncompressed model and the pruned models does not change much with respect to different context lengths. The pruning ratio will only affect the variance of KL divergence values measured in different context lengths.
  • Figure 2: Only about 2% weights are sensitive to low-bit quantizations. We can use 8-bit quantization instead of 3/4-bit quantization for these weights to make the compressed models less sensitive to context lengths.
  • Figure 3: When we use low-bit ($\le 4$) weight quantization, the performance of compressed models becomes more sensitive to context lengths: the output of compressed models become more different from the of the uncompressed model when the context length increases.
  • Figure 4: When choosing 3-bit quantization, it is very obvious that the output of compressed models become more different from the of the uncompressed model when the context length increases.
  • Figure 5: If we randomly prune out only 10% weights, we can observe an almost-linearly increasing KL divergence as context length increases.