Table of Contents
Fetching ...

Do Compressed LLMs Forget Knowledge? An Experimental Study with Practical Implications

Duc N. M Hoang, Minsik Cho, Thomas Merth, Mohammad Rastegari, Zhangyang Wang

TL;DR

The paper investigates whether compression erases or merely hides LLM knowledge. It formalizes two hypotheses—knowledge forgetting and knowledge displacement—and tests them with extensive experiments comparing prompting and parameter-efficient tuning against LoRA. It introduces Inference-time Dynamic Prompting (IDP), a lightweight method that selects among prompts at inference with negligible overhead, and shows it recovers post-compression performance as well as or better than LoRA while using far fewer parameters and lower latency. Analyses of attention and activations reveal distinct redirection patterns for prompting, supporting the knowledge displacement view. The results have practical implications for deploying compressed LLMs, enabling efficient knowledge recovery with minimal compute and storage costs.

Abstract

Compressing Large Language Models (LLMs) often leads to reduced performance, especially for knowledge-intensive tasks. In this work, we dive into how compression damages LLMs' inherent knowledge and the possible remedies. We start by proposing two conjectures on the nature of the damage: one is certain knowledge being forgotten (or erased) after LLM compression, hence necessitating the compressed model to (re)learn from data with additional parameters; the other presumes that knowledge is internally displaced and hence one requires merely "inference re-direction" with input-side augmentation such as prompting, to recover the knowledge-related performance. Extensive experiments are then designed to (in)validate the two conjectures. We observe the promise of prompting in comparison to model tuning; we further unlock prompting's potential by introducing a variant called Inference-time Dynamic Prompting (IDP), that can effectively increase prompt diversity without incurring any inference overhead. Our experiments consistently suggest that compared to the classical re-training alternatives such as LoRA, prompting with IDP leads to better or comparable post-compression performance recovery, while saving the extra parameter size by 21x and reducing inference latency by 60%. Our experiments hence strongly endorse the conjecture of "knowledge displaced" over "knowledge forgotten", and shed light on a new efficient mechanism to restore compressed LLM performance. We additionally visualize and analyze the different attention and activation patterns between prompted and re-trained models, demonstrating they achieve performance recovery in two different regimes.

Do Compressed LLMs Forget Knowledge? An Experimental Study with Practical Implications

TL;DR

The paper investigates whether compression erases or merely hides LLM knowledge. It formalizes two hypotheses—knowledge forgetting and knowledge displacement—and tests them with extensive experiments comparing prompting and parameter-efficient tuning against LoRA. It introduces Inference-time Dynamic Prompting (IDP), a lightweight method that selects among prompts at inference with negligible overhead, and shows it recovers post-compression performance as well as or better than LoRA while using far fewer parameters and lower latency. Analyses of attention and activations reveal distinct redirection patterns for prompting, supporting the knowledge displacement view. The results have practical implications for deploying compressed LLMs, enabling efficient knowledge recovery with minimal compute and storage costs.

Abstract

Compressing Large Language Models (LLMs) often leads to reduced performance, especially for knowledge-intensive tasks. In this work, we dive into how compression damages LLMs' inherent knowledge and the possible remedies. We start by proposing two conjectures on the nature of the damage: one is certain knowledge being forgotten (or erased) after LLM compression, hence necessitating the compressed model to (re)learn from data with additional parameters; the other presumes that knowledge is internally displaced and hence one requires merely "inference re-direction" with input-side augmentation such as prompting, to recover the knowledge-related performance. Extensive experiments are then designed to (in)validate the two conjectures. We observe the promise of prompting in comparison to model tuning; we further unlock prompting's potential by introducing a variant called Inference-time Dynamic Prompting (IDP), that can effectively increase prompt diversity without incurring any inference overhead. Our experiments consistently suggest that compared to the classical re-training alternatives such as LoRA, prompting with IDP leads to better or comparable post-compression performance recovery, while saving the extra parameter size by 21x and reducing inference latency by 60%. Our experiments hence strongly endorse the conjecture of "knowledge displaced" over "knowledge forgotten", and shed light on a new efficient mechanism to restore compressed LLM performance. We additionally visualize and analyze the different attention and activation patterns between prompted and re-trained models, demonstrating they achieve performance recovery in two different regimes.
Paper Structure (24 sections, 7 figures, 3 tables)

This paper contains 24 sections, 7 figures, 3 tables.

Figures (7)

  • Figure 1: This figure presents a comparative analysis of the performance of compressed models using GPTQ for quantization and SparseGPT for pruning. The models were compressed leveraging either C4 or Wikitext datasets. Their average performance is depicted across a spectrum of nine tasks, each representing diverse knowledge domains.
  • Figure 2: Using a 3-bit quantized Llama-7b model fine-tuned on C4 dataset, we contrast the average accuracy across nine tasks against its word's perplexity score across various prompt lengths. A longer sequence length improves perplexity but does not always sustain better performance.
  • Figure 3: This figure underscores the key advantage of inference-time dynamic prompting (IDP): its minimalistic yet effective design. By making straightforward alterations to the existing weighted sum operation and using the existing attention matrix for prompt selection, IDP accomplishes its objectives without incurring any additional parameter costs.
  • Figure 4: GPTQ LLama-7b/OPT-6.7b average accuracy across nine tasks vs. number of trainable parameters. IDP shows remarkable efficiency and performance comparing to methods parameter-intensive method like LoRA.
  • Figure 5: Cosine similarity compares the self-attention and token activation at each layer to an uncompressed baseline using different fine-tuning techniques. A higher cosine score means it's closer to the baseline.
  • ...and 2 more figures