Table of Contents
Fetching ...

Compressing LLMs: The Truth is Rarely Pure and Never Simple

Ajay Jaiswal, Zhe Gan, Xianzhi Du, Bowen Zhang, Zhangyang Wang, Yinfei Yang

TL;DR

Modern LLMs incur prohibitive compute and memory costs, motivating pruning and quantization as cost-reduction strategies. Perplexity alone is insufficient to assess compressed LLMs, prompting the Knowledge-Intensive Compressed LLM Benchmark (LLM-KICK) to evaluate knowledge access, augmentation, and instruction following beyond perplexity. The study finds pruning often degrades knowledge-related tasks and struggles with structured N:M sparsity, while quantization generally preserves or improves performance on several settings; in-context retrieval and summarization show robustness to compression when augmented knowledge is provided. These results argue for richer evaluation protocols and point to directions combining compression with calibration or parameter-efficient fine-tuning to better preserve true capabilities.

Abstract

Despite their remarkable achievements, modern Large Language Models (LLMs) face exorbitant computational and memory footprints. Recently, several works have shown significant success in training-free and data-free compression (pruning and quantization) of LLMs that achieve 50 - 60% sparsity and reduce the bit width to 3 or 4 bits per weight, with negligible degradation of perplexity over the uncompressed baseline. As recent research efforts are focused on developing increasingly sophisticated compression methods, our work takes a step back and re-evaluates the effectiveness of existing SoTA compression methods, which rely on a fairly simple and widely questioned metric, perplexity (even for dense LLMs). We introduce Knowledge-Intensive Compressed LLM BenchmarK (LLM-KICK), a collection of carefully curated tasks to redefine the evaluation protocol for compressed LLMs, which have significant alignment with their dense counterparts and perplexity fail to capture subtle change in their true capabilities. LLM-KICK unveils many favorable merits and unfortunate plights of current SoTA compression methods: all pruning methods suffer significant performance degradation, sometimes at trivial sparsity ratios (e.g., 25-30%), and fail for N:M sparsity in knowledge-intensive tasks; current quantization methods are more successful than pruning; yet, pruned LLMs even at $\geq 50$% sparsity are robust in-context retrieval and summarization systems; among others. LLM-KICK is designed to holistically access compressed LLMs' ability for language understanding, reasoning, generation, in-context retrieval, in-context summarization, etc. We hope our study can foster the development of better LLM compression methods. The reproduced codes are available at https://github.com/VITA-Group/llm-kick.

Compressing LLMs: The Truth is Rarely Pure and Never Simple

TL;DR

Modern LLMs incur prohibitive compute and memory costs, motivating pruning and quantization as cost-reduction strategies. Perplexity alone is insufficient to assess compressed LLMs, prompting the Knowledge-Intensive Compressed LLM Benchmark (LLM-KICK) to evaluate knowledge access, augmentation, and instruction following beyond perplexity. The study finds pruning often degrades knowledge-related tasks and struggles with structured N:M sparsity, while quantization generally preserves or improves performance on several settings; in-context retrieval and summarization show robustness to compression when augmented knowledge is provided. These results argue for richer evaluation protocols and point to directions combining compression with calibration or parameter-efficient fine-tuning to better preserve true capabilities.

Abstract

Despite their remarkable achievements, modern Large Language Models (LLMs) face exorbitant computational and memory footprints. Recently, several works have shown significant success in training-free and data-free compression (pruning and quantization) of LLMs that achieve 50 - 60% sparsity and reduce the bit width to 3 or 4 bits per weight, with negligible degradation of perplexity over the uncompressed baseline. As recent research efforts are focused on developing increasingly sophisticated compression methods, our work takes a step back and re-evaluates the effectiveness of existing SoTA compression methods, which rely on a fairly simple and widely questioned metric, perplexity (even for dense LLMs). We introduce Knowledge-Intensive Compressed LLM BenchmarK (LLM-KICK), a collection of carefully curated tasks to redefine the evaluation protocol for compressed LLMs, which have significant alignment with their dense counterparts and perplexity fail to capture subtle change in their true capabilities. LLM-KICK unveils many favorable merits and unfortunate plights of current SoTA compression methods: all pruning methods suffer significant performance degradation, sometimes at trivial sparsity ratios (e.g., 25-30%), and fail for N:M sparsity in knowledge-intensive tasks; current quantization methods are more successful than pruning; yet, pruned LLMs even at % sparsity are robust in-context retrieval and summarization systems; among others. LLM-KICK is designed to holistically access compressed LLMs' ability for language understanding, reasoning, generation, in-context retrieval, in-context summarization, etc. We hope our study can foster the development of better LLM compression methods. The reproduced codes are available at https://github.com/VITA-Group/llm-kick.
Paper Structure (46 sections, 17 figures, 3 tables)

This paper contains 46 sections, 17 figures, 3 tables.

Figures (17)

  • Figure 1: True Merits of SoTA Compression. Top row indicates marginal increase in perplexity via using SoTA compression methods, when compared with simple magnitude-based pruning. Bottom row indicates the failure of compressed Vicuna-7B vicuna2023 (via Magnitude, Wanda, SparseGPT, GPTQ) to respond correctly to knowledge-intensive factoid-based questions.
  • Figure 2: Compressed LLMs for Factoid-based QA. Performance comparison of compressed LLMs on Factoid-QA task using FreebaseQA Jiang2019FreebaseQAAN. Results (average across 3 independent runs) presented are for structured (N:M sparsity), unstructured sparsity, and quantization.
  • Figure 3: Compressed LLMs for Multiple-Choice Reasoning based QA. Performance comparison of compressed LLMs on MCR-QA tasks using the MMLU benchmark hendrycks2020measuring. Results (average across 3 independent runs) presented are for structured (N:M sparsity), unstructured sparsity, and quantization.
  • Figure 4: Compressed LLMs for In-context Retrieval Augmented QA. Performance comparison of compressed LLMs on ICRA-QA task. We present head-to-head comparison of closed-book evaluation (no external knowledge is augmented in-context) with open-book evaluation (external knowledge is augmented in-context). Results (average across 3 independent runs) presented are for structured N:M sparsity, unstructured sparsity, and quantization.
  • Figure 5: Compressed LLMs for In-Context Summarization. Performance comparison of compressed Vicuna-7B for in-context summarization of small, medium, and large stories while preserving coherence, consistency, fluency, and relevance. Results (average across 3 independent runs) presented are for structured (2:4 sparsity - Row 3), unstructured sparsity, and quantization.
  • ...and 12 more figures