Beyond Perplexity: Multi-dimensional Safety Evaluation of LLM Compression

Zhichao Xu; Ashim Gupta; Tao Li; Oliver Bentham; Vivek Srikumar

Beyond Perplexity: Multi-dimensional Safety Evaluation of LLM Compression

Zhichao Xu, Ashim Gupta, Tao Li, Oliver Bentham, Vivek Srikumar

TL;DR

This work tackles safety in locally deployed compressed LLMs by examining four dimensions of harm beyond perplexity: degeneration harm, representational harm, dialect bias, and downstream performance. It systematically compares pruning (unstructured and semi-structured) and quantization across two base models (Llama-2 and Tülu-2) and multiple sizes, revealing that compression can reduce generation toxicity while sometimes increasing discrimination-related biases, with divergent effects across protected groups and dialects. Key findings show that quantization tends to preserve safety and performance better than pruning at comparable rates, whereas SFT can reduce degeneration but not representational harms, and the order of pruning and fine-tuning matters for downstream tasks and biases. The paper advocates for integrating fine-grained safety evaluations into compression workflows to ensure reliable, equitable behavior in real-world deployments.

Abstract

Increasingly, model compression techniques enable large language models (LLMs) to be deployed in real-world applications. As a result of this momentum towards local deployment, compressed LLMs will interact with a large population. Prior work on compression typically prioritize preserving perplexity, which is directly analogous to training loss. The impact of compression method on other critical aspects of model behavior\, -- \,particularly safety\, -- \,requires systematic assessment. To this end, we investigate the impact of model compression along four dimensions: (1) degeneration harm, i.e., bias and toxicity in generation; (2) representational harm, i.e., biases in discriminative tasks; (3) dialect bias; and(4) language modeling and downstream task performance. We examine a wide spectrum of LLM compression techniques, including unstructured pruning, semi-structured pruning, and quantization. Our analysis reveals that compression can lead to unexpected consequences. Although compression may unintentionally alleviate LLMs' degeneration harm, it can still exacerbate representational harm. Furthermore, increasing compression produces a divergent impact on different protected groups. Finally, different compression methods have drastically different safety impacts: for example, quantization mostly preserves bias while pruning degrades quickly. Our findings underscore the importance of integrating safety assessments into the development of compressed LLMs to ensure their reliability across real-world applications.\footnote{Our implementation and results are available here: \url{https://github.com/zhichaoxu-shufe/Beyond-Perplexity-Compression-Safety-Eval}}

Beyond Perplexity: Multi-dimensional Safety Evaluation of LLM Compression

TL;DR

Abstract

Paper Structure (38 sections, 2 equations, 7 figures, 33 tables)

This paper contains 38 sections, 2 equations, 7 figures, 33 tables.

Introduction
Background
Potential Harms by LLMs
Compression Methods for LLMs.
Prior Works on LLM Compression Evaluation
Evaluating Compression Models
Compression Algorithms and Ratios
Safety Evaluation Dimensions
Performance Evaluation Dimensions
Degeneration Harm & Representational Harms
How Does Compression Affect Different Protected Groups?
How Does Compression Affect Different Dialects of English?
The Impact of Supervised Fine-tuning
Conclusions and Recommendations
Details of Datasets and Corresponding Evaluations
...and 23 more sections

Figures (7)

Figure 1: Llama-2-13B's compression results on different datasets. X-axis refers to compression ratio. LLM.int8(), AWQ, GPTQ are of 50%, 75% and 75% compression ratio, respectively. 7B models show similar trends (\ref{['fig:llama2_7b_comprehensive']}).
Figure 2: Change of representational bias ($\downarrow$) against different groups, as compression ratio increases, with 13B models. Although aggregated bias metric are relatively stable, different protected groups have vastly different behaviors. Results with 7B models show similar trends (\ref{['fig:intra_group_heatmap_appendix']}).
Figure 3: Llama-2-13B perplexity ($\downarrow$) evaluation results for dialect bias. Note that AWQ and GPTQ have close results thus their markers are overlapped in the plots. Llama-2-7B shows similar trends (\ref{['fig:perplexity_figure_appendix']}).
Figure 4: Bias (left) and Accuracy (right) results on BBQ dataset between SFT$\rightarrow$Prune and Prune$\rightarrow$SFT.
Figure 5: Llama-2-7B's compression results on different datasets. x-axis refers to compression ratio. LLM.int8(), AWQ, GPTQ are of 50%, 75% and 75% compression ratio, respectively.
...and 2 more figures

Beyond Perplexity: Multi-dimensional Safety Evaluation of LLM Compression

TL;DR

Abstract

Beyond Perplexity: Multi-dimensional Safety Evaluation of LLM Compression

Authors

TL;DR

Abstract

Table of Contents

Figures (7)