Table of Contents
Fetching ...

Alignment-Aware Quantization for LLM Safety

Sunghyun Wee, Suyoung Kim, Hyeonjin Kim, Kyomin Hwang, Nojun Kwak

TL;DR

This work addresses the risk that standard post-training quantization (PTQ) can erode RLHF-driven safety alignment in large language models. It introduces Alignment-Aware Quantization (AAQ), which incorporates an Alignment-Preserving Contrastive (APC) loss into the PTQ pipeline to actively preserve alignment while achieving aggressive 4-bit quantization ($W4A4$). The APC pull-push objective leverages a safe, fine-tuned model and an unsafe pre-trained reference to guide the quantized model, using top-$K$ filtering to stabilize training on alignment-sensitive outputs. Across diverse model families, AAQ achieves robust safety preservation with minimal sacrifice to other utilities, offering a practical, generalizable path to efficient and trustworthy quantized LLMs. This approach decouples safety from perplexity and demonstrates that alignment-aware quantization can maintain safety without specialized calibration data, enabling safer deployment at scale.

Abstract

Safety and efficiency are both important factors when deploying large language models(LLMs). LLMs are trained to follow human alignment for safety, and post training quantization(PTQ) is applied afterward for efficiency. However, these two objectives are often in conflict, revealing a fundamental flaw in the conventional PTQ paradigm: quantization can turn into a safety vulnerability if it only aims to achieve low perplexity. Models can demonstrate low perplexity yet exhibit significant degradation in alignment with the safety policy, highlighting that perplexity alone is an insufficient and often misleading proxy for model safety. To address this, we propose Alignment-Aware Quantization(AAQ), a novel approach that integrates Alignment-Preserving Contrastive(APC) loss into the PTQ pipeline. Compared to simple reconstruction loss, ours explicitly preserves alignment by encouraging the quantized model to mimic its safe, instruction-tuned model while diverging from the unaligned, pre-trained counterpart. Our method achieves this robust safety alignment without resorting to specialized safety-focused calibration datasets, highlighting its practical utility and broad applicability. AAQ is compatible with standard PTQ techniques and enables robust 4-bit (W4A4) quantization across diverse model families such as LLaMA, Qwen, and Mistral while maintaining safety where previous methods fail. Our work resolves the critical trade-off between efficiency and safety, paving the way toward LLMs that are both efficient and trustworthy. Anonymized code is available in the supplementary material.

Alignment-Aware Quantization for LLM Safety

TL;DR

This work addresses the risk that standard post-training quantization (PTQ) can erode RLHF-driven safety alignment in large language models. It introduces Alignment-Aware Quantization (AAQ), which incorporates an Alignment-Preserving Contrastive (APC) loss into the PTQ pipeline to actively preserve alignment while achieving aggressive 4-bit quantization (). The APC pull-push objective leverages a safe, fine-tuned model and an unsafe pre-trained reference to guide the quantized model, using top- filtering to stabilize training on alignment-sensitive outputs. Across diverse model families, AAQ achieves robust safety preservation with minimal sacrifice to other utilities, offering a practical, generalizable path to efficient and trustworthy quantized LLMs. This approach decouples safety from perplexity and demonstrates that alignment-aware quantization can maintain safety without specialized calibration data, enabling safer deployment at scale.

Abstract

Safety and efficiency are both important factors when deploying large language models(LLMs). LLMs are trained to follow human alignment for safety, and post training quantization(PTQ) is applied afterward for efficiency. However, these two objectives are often in conflict, revealing a fundamental flaw in the conventional PTQ paradigm: quantization can turn into a safety vulnerability if it only aims to achieve low perplexity. Models can demonstrate low perplexity yet exhibit significant degradation in alignment with the safety policy, highlighting that perplexity alone is an insufficient and often misleading proxy for model safety. To address this, we propose Alignment-Aware Quantization(AAQ), a novel approach that integrates Alignment-Preserving Contrastive(APC) loss into the PTQ pipeline. Compared to simple reconstruction loss, ours explicitly preserves alignment by encouraging the quantized model to mimic its safe, instruction-tuned model while diverging from the unaligned, pre-trained counterpart. Our method achieves this robust safety alignment without resorting to specialized safety-focused calibration datasets, highlighting its practical utility and broad applicability. AAQ is compatible with standard PTQ techniques and enables robust 4-bit (W4A4) quantization across diverse model families such as LLaMA, Qwen, and Mistral while maintaining safety where previous methods fail. Our work resolves the critical trade-off between efficiency and safety, paving the way toward LLMs that are both efficient and trustworthy. Anonymized code is available in the supplementary material.

Paper Structure

This paper contains 22 sections, 6 equations, 3 figures, 4 tables, 1 algorithm.

Figures (3)

  • Figure 1: Examples of alignment regression after quantization. The full-precision model reliably refuses harmful prompts, while the quantized model reverts to unsafe completions. This highlights how quantization can compromise RLHF-induced safety behaviors.
  • Figure 2: Overview of the standard deployment pipeline compared to our proposed Alignment-Aware Quantization (AAQ). (a) A standard LLM deployment pipeline where a fine-tuned model undergoes quantization. This process may include a transformation step optimized for reconstruction error but lacks an explicit mechanism to preserve safety, leading to a potentially unsafe quantized model. (b) Our proposed AAQ pipeline introduces an Alignment-aware Transformation step before quantization. This step uses a contrastive objective that leverages both the safe, fine-tuned model and the unsafe, pre-trained model to ensure the final quantized model is both efficient and safely aligned.
  • Figure 3: An illustration of our Alignment-Aware Quantization (AAQ) method at a linear layer. Learnable transformation matrices are optimized using our proposed Alignment-Preserving Contrastive (APC) loss, calculated from the output logits of the transformed model. The loss implements a pull-push mechanism: it pulls the distribution towards the safe, fine-tuned model ($p_{FT}$) and pushes it away from the unsafe, pre-trained model ($p_{PT}$). The pull component ($\mathcal{L}_{\text{KL-top}}$) focuses on high-probability logits from the fine-tuned model, while the push component ($\mathcal{L}_{\text{cont-top}}$) concentrates on logits where the two reference models differ the most.