Evaluating Zero-Shot Long-Context LLM Compression
Chenyu Wang, Yihan Wang, Kai Li
TL;DR
This work probes zero-shot compression of LLMs under long-context, using LLaMA-2-7B-32K to benchmark pruning (magnitude pruning and Wanda) and weight-only / weight-activation quantization. Theoretical analysis predicts cumulative computational error as context grows due to noise introduced by compression, and empirical results show pruning remains robust to context length while aggressive quantization degrades performance; a key finding is that selectively quantizing only about $2\%$ of high-magnitude weight groups to $8$-bit while quantizing the rest to $3$–$4$ bits can mitigate long-context degradation. The proposed hypothesis links weight-sensitivity to long-range dependencies and suggests targeted quantization strategies to preserve accuracy in extended contexts. These insights could enable more efficient deployment of LLMs with very long input windows without incurring prohibitive accuracy losses.
Abstract
This study evaluates the effectiveness of zero-shot compression techniques on large language models (LLMs) under long-context. We identify the tendency for computational errors to increase under long-context when employing certain compression methods. We propose a hypothesis to explain the varied behavior of different LLM compression techniques and explore remedies to mitigate the performance decline observed in some techniques under long-context. This is a course report for COS 598D Machine Learning and Systems by Prof. Kai Li at Princeton University. Due to limited computational resources, our experiments were conducted only on LLaMA-2-7B-32K.
