Table of Contents
Fetching ...

Through a Compressed Lens: Investigating The Impact of Quantization on Factual Knowledge Recall

Qianli Wang, Mingyang Wang, Nils Feldhus, Simon Ostermann, Yuan Cao, Hinrich Schütze, Sebastian Möller, Vera Schmitt

TL;DR

This study investigates how post-training weight quantization affects factual knowledge recall (FKR) in large language models, using two interpretability-driven tasks—knowledge memorization and latent multi-hop reasoning—to examine internal storage/retrieval and reasoning pathways. It compares full-precision models to quantized variants across three PTQ methods and multiple bit-widths, on datasets probing one-hop and two-hop factual knowledge. The findings show that quantization generally degrades FKR, especially in smaller models, though some configurations preserve or even improve recall, with BitSandBytes providing the strongest preservation. The work provides neuron- and layer-level analyses to trace information loss to late network stages and demonstrates that while quantization is a viable compression strategy with modest FKR impact, generalization to multilingual settings and larger architectures remains an avenue for future work.

Abstract

Quantization methods are widely used to accelerate inference and streamline the deployment of large language models (LLMs). Although quantization's effects on various LLM capabilities have been extensively studied, one critical area remains underexplored: factual knowledge recall (FKR), the process by which LLMs access stored knowledge. To this end, we conduct comprehensive experiments using three common quantization techniques at distinct bit widths, in conjunction with interpretability-driven analyses on two tasks, knowledge memorization and latent multi-hop reasoning. We show that quantization typically results in information loss within LLMs, consequently diminishing their capacity for FKR. This effect is particularly amplified in smaller models within the same architectural families. However, models quantized at reduced bit precision do not consistently exhibit inferior performance and occasionally quantization may even enhance model FKR. We find that BitSandBytes demonstrates highest preservation of the original full-precision model's FKR. Despite variability across models and methods, quantization causes modest performance degradation and remains an effective compression strategy.

Through a Compressed Lens: Investigating The Impact of Quantization on Factual Knowledge Recall

TL;DR

This study investigates how post-training weight quantization affects factual knowledge recall (FKR) in large language models, using two interpretability-driven tasks—knowledge memorization and latent multi-hop reasoning—to examine internal storage/retrieval and reasoning pathways. It compares full-precision models to quantized variants across three PTQ methods and multiple bit-widths, on datasets probing one-hop and two-hop factual knowledge. The findings show that quantization generally degrades FKR, especially in smaller models, though some configurations preserve or even improve recall, with BitSandBytes providing the strongest preservation. The work provides neuron- and layer-level analyses to trace information loss to late network stages and demonstrates that while quantization is a viable compression strategy with modest FKR impact, generalization to multilingual settings and larger architectures remains an avenue for future work.

Abstract

Quantization methods are widely used to accelerate inference and streamline the deployment of large language models (LLMs). Although quantization's effects on various LLM capabilities have been extensively studied, one critical area remains underexplored: factual knowledge recall (FKR), the process by which LLMs access stored knowledge. To this end, we conduct comprehensive experiments using three common quantization techniques at distinct bit widths, in conjunction with interpretability-driven analyses on two tasks, knowledge memorization and latent multi-hop reasoning. We show that quantization typically results in information loss within LLMs, consequently diminishing their capacity for FKR. This effect is particularly amplified in smaller models within the same architectural families. However, models quantized at reduced bit precision do not consistently exhibit inferior performance and occasionally quantization may even enhance model FKR. We find that BitSandBytes demonstrates highest preservation of the original full-precision model's FKR. Despite variability across models and methods, quantization causes modest performance degradation and remains an effective compression strategy.

Paper Structure

This paper contains 26 sections, 28 figures, 5 tables.

Figures (28)

  • Figure 1: The effect of quantization on factual knowledge recall through knowledge memorization analysis and latent multi-hop reasoning analysis.
  • Figure 2: Top neuron distribution (Qwen2.5-7B)
  • Figure 3: Top neuron distribution (Llama3-8B)
  • Figure 4: Attention (Qwen2.5-7B): Landmark on continent
  • Figure 5: FFN (Qwen2.5-7B): Landmark on continent
  • ...and 23 more figures