Evaluating the Impact of Compression Techniques on Task-Specific Performance of Large Language Models

Bishwash Khanal; Jeffery M. Capone

Evaluating the Impact of Compression Techniques on Task-Specific Performance of Large Language Models

Bishwash Khanal, Jeffery M. Capone

TL;DR

This study evaluates the impact of popular compression methods - Magnitude Pruning, SparseGPT, and Wanda - on the LLaMA-2-7B model, focusing on the trade-offs between model size reduction, downstream task performance, and the role of calibration data.

Abstract

Large language models (LLMs) offer powerful capabilities but incur substantial computational costs, driving the need for efficient compression techniques. This study evaluates the impact of popular compression methods - Magnitude Pruning, SparseGPT, and Wanda - on the LLaMA-2-7B model, focusing on the trade-offs between model size reduction, downstream task performance, and the role of calibration data. Our findings reveal that while SparseGPT and Wanda preserve perplexity even at 50% sparsity, they suffer significant degradation on downstream tasks, highlighting the inadequacy of perplexity as the sole evaluation metric. To address this, we introduce Jensen-Shannon (JS) Divergence as a more comprehensive metric that captures nuanced changes in model behavior post-compression. We further demonstrate that task-specific calibration data significantly enhances the downstream performance of compressed models compared to general calibration data. This research underscores the necessity for diverse evaluation metrics and careful calibration data selection to fully understand the complexities of LLM compression and its implications for practical applications.

Evaluating the Impact of Compression Techniques on Task-Specific Performance of Large Language Models

TL;DR

Abstract

Paper Structure (13 sections, 4 equations, 4 figures, 4 tables)

This paper contains 13 sections, 4 equations, 4 figures, 4 tables.

Introduction
Related Works
JS Divergence as a Evaluation Metric
Evaluation
General Performance
Downstream Task Performance
Jensen-Shannon Divergence (JS) Evaluation
Impact of Calibration Data on LLM Compression
Discussion
Conclusion
Downstream Task Metrics
GPT-4o Evaluation
GPT-4 vs GPT-4o

Figures (4)

Figure 1: JS Divergence evaluated on compressed models against general and downstream task metrics.
Figure 2: Template used for evaluating the quality of responses generated by the compressed models with GPT-4. It includes a system prompt, user prompt, instruction-input pair, ideal response, and generated response (Wanda at 30% sparsity), providing a structured approach for assessing accuracy, completeness, and relevance.
Figure 3: GPT-4 Evaluation compared with JS Divergence and Perplexity on compressed models.
Figure 4: GPT-4 vs GPT-4o evaluation compared with JS Divergence on compressed models.

Evaluating the Impact of Compression Techniques on Task-Specific Performance of Large Language Models

TL;DR

Abstract

Evaluating the Impact of Compression Techniques on Task-Specific Performance of Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (4)