Table of Contents
Fetching ...

Intrinsic Structure as a Proxy for Saliency: SVD-Based Weight Preservation for Mixed-Precision Quantization in Large Language Models

Shashank Landge, Abhishek Patil, Tejas kamble, Bhushan Buddhivant, Priyanka Joshi

TL;DR

This work tackles the challenge of efficiently quantizing large language models without calibration data by proposing a data-free, structure-aware approach. It argues that weights aligned with a model's principal structure, as revealed by SVD, are intrinsically important for downstream performance and should be preserved in FP32 while the rest are quantized to 4-bit. Through experiments on GLUE tasks with DistilBERT, the SVD-based method matches or exceeds activation- and Hessian-based saliency methods, with notable gains on RTE and QNLI, and shows strong overlap with Hessian-based selections. The findings demonstrate that intrinsic weight structure can serve as a robust proxy for saliency, enabling secure, privacy-preserving model compression without forward passes or calibration data.

Abstract

As Large Language Models (LLMs) continue to scale in parameter count, deploying them on commodity hardware has become increasingly challenging. Post-Training Quantization (PTQ) addresses this by reducing the precision of model weights, typically to 4-bit or lower. However, uniform quantization often leads to significant performance degradation due to the presence of ``outlier features'' -- weights that, while few in number, are critical for maintaining model accuracy. Current state-of-the-art methods such as AWQ (Activation-aware Weight Quantization) and SpQR (Sparse Quantization Representations) rely on calibration data to identify these salient weights via activation magnitudes or Hessian sensitivity. In scenarios where data privacy is paramount or calibration data is unavailable, these methods are inapplicable. In this work, we propose a data-free, structure-aware hypothesis: that the weights identified as Principal Components via Singular Value Decomposition (SVD) are intrinsically important to the model's downstream performance. We introduce a novel selection heuristic that preserves the top-$k$ weights aligned with the principal components in FP32, while aggressively quantizing the residual weights. We compare our method against activation-aware (AWQ) and second-order (SpQR) methods across GLUE benchmarks (MRPC, RTE, QNLI) using a DistilBERT backbone. Our experiments reveal that structural importance is highly correlated with functional importance. On the challenging RTE task, our SVD-based method achieves an accuracy of 66.06\%, outperforming both AWQ (65.34\%) and SpQR (65.34\%) at high protection budgets, validating that intrinsic matrix structure can serve as a robust proxy for weight saliency without the need for forward passes or calibration data.

Intrinsic Structure as a Proxy for Saliency: SVD-Based Weight Preservation for Mixed-Precision Quantization in Large Language Models

TL;DR

This work tackles the challenge of efficiently quantizing large language models without calibration data by proposing a data-free, structure-aware approach. It argues that weights aligned with a model's principal structure, as revealed by SVD, are intrinsically important for downstream performance and should be preserved in FP32 while the rest are quantized to 4-bit. Through experiments on GLUE tasks with DistilBERT, the SVD-based method matches or exceeds activation- and Hessian-based saliency methods, with notable gains on RTE and QNLI, and shows strong overlap with Hessian-based selections. The findings demonstrate that intrinsic weight structure can serve as a robust proxy for saliency, enabling secure, privacy-preserving model compression without forward passes or calibration data.

Abstract

As Large Language Models (LLMs) continue to scale in parameter count, deploying them on commodity hardware has become increasingly challenging. Post-Training Quantization (PTQ) addresses this by reducing the precision of model weights, typically to 4-bit or lower. However, uniform quantization often leads to significant performance degradation due to the presence of ``outlier features'' -- weights that, while few in number, are critical for maintaining model accuracy. Current state-of-the-art methods such as AWQ (Activation-aware Weight Quantization) and SpQR (Sparse Quantization Representations) rely on calibration data to identify these salient weights via activation magnitudes or Hessian sensitivity. In scenarios where data privacy is paramount or calibration data is unavailable, these methods are inapplicable. In this work, we propose a data-free, structure-aware hypothesis: that the weights identified as Principal Components via Singular Value Decomposition (SVD) are intrinsically important to the model's downstream performance. We introduce a novel selection heuristic that preserves the top- weights aligned with the principal components in FP32, while aggressively quantizing the residual weights. We compare our method against activation-aware (AWQ) and second-order (SpQR) methods across GLUE benchmarks (MRPC, RTE, QNLI) using a DistilBERT backbone. Our experiments reveal that structural importance is highly correlated with functional importance. On the challenging RTE task, our SVD-based method achieves an accuracy of 66.06\%, outperforming both AWQ (65.34\%) and SpQR (65.34\%) at high protection budgets, validating that intrinsic matrix structure can serve as a robust proxy for weight saliency without the need for forward passes or calibration data.

Paper Structure

This paper contains 27 sections, 8 equations, 2 figures, 3 tables.

Figures (2)

  • Figure 1: Accuracy vs. Protection Budget (k). Comparing our SVD-based method (Blue) against AWQ (Green) and SpQR (Red). Our method consistently matches or beats the baselines without using any calibration data.
  • Figure 2: Selection Similarity (%). Intersection over Union (IoU) of the weights selected by our SVD method vs. AWQ (Teal) and SpQR (Maroon). The high overlap with SpQR confirms that SVD is a strong proxy for Hessian-based sensitivity.