Table of Contents
Fetching ...

Quantized Delta Weight Is Safety Keeper

Yule Liu, Zhen Sun, Xinlei He, Xinyi Huang

TL;DR

The paper tackles resource and security challenges in fine-tuning large language models by analyzing BitDelta-style one-bit delta-weight quantization. It demonstrates that compressing only the delta weights, with a lightweight parameter healing step, can yield substantial security gains against alignment-breaking, backdoors, and certain hallucination risks, while limiting utility loss on a representative Llama-2-7b-chat case study. Through extensive experiments across multiple model families and scales, and using LogitLens visualizations, the authors reveal a mechanism whereby delta compression preserves core safety alignments and reduces vulnerability to malicious fine-tuning. The findings suggest a practical, cost-effective defense strategy for secure, multi-tenant LLM deployment, with a trade-off managed by tuning compression fidelity. Overall, delta-weight quantization emerges as a promising approach to simultaneously reduce storage/inference overhead and fortify model safety in real-world fine-tuning scenarios.

Abstract

Recent advancements in fine-tuning proprietary language models enable customized applications across various domains but also introduce two major challenges: high resource demands and security risks. Regarding resource demands, recent work proposes novel partial compression, such as BitDelta, to quantize the delta weights between the fine-tuned model and base model. Regarding the security risks, user-defined fine-tuning can introduce security vulnerabilities, such as alignment issues, backdoor attacks, and hallucinations. However, most of the current efforts in security assessment focus on the full-precision or full-compression models, it is not well-discussed how the partial compression methods affect security concerns. To bridge this gap, we evaluate the robustness of delta-weight quantization against these security threats. In this paper, we uncover a "free lunch" phenomenon: partial compression can enhance model security against fine-tuning-based attacks with bearable utility loss. Using Llama-2-7b-chat as a case study, we show that, with under 10% utility degradation, the partial compression mitigates alignment-breaking risks by up to 66.17%, harmful backdoor vulnerabilities by 64.46%, and targeted output manipulation risks by up to 90.53%. We further apply LogitLens to visualize internal state transformations during forward passes, suggesting mechanisms for both security failure and recovery in standard versus compressed fine-tuning. This work offers new insights into selecting effective delta compression methods for secure, resource-efficient multi-tenant services.

Quantized Delta Weight Is Safety Keeper

TL;DR

The paper tackles resource and security challenges in fine-tuning large language models by analyzing BitDelta-style one-bit delta-weight quantization. It demonstrates that compressing only the delta weights, with a lightweight parameter healing step, can yield substantial security gains against alignment-breaking, backdoors, and certain hallucination risks, while limiting utility loss on a representative Llama-2-7b-chat case study. Through extensive experiments across multiple model families and scales, and using LogitLens visualizations, the authors reveal a mechanism whereby delta compression preserves core safety alignments and reduces vulnerability to malicious fine-tuning. The findings suggest a practical, cost-effective defense strategy for secure, multi-tenant LLM deployment, with a trade-off managed by tuning compression fidelity. Overall, delta-weight quantization emerges as a promising approach to simultaneously reduce storage/inference overhead and fortify model safety in real-world fine-tuning scenarios.

Abstract

Recent advancements in fine-tuning proprietary language models enable customized applications across various domains but also introduce two major challenges: high resource demands and security risks. Regarding resource demands, recent work proposes novel partial compression, such as BitDelta, to quantize the delta weights between the fine-tuned model and base model. Regarding the security risks, user-defined fine-tuning can introduce security vulnerabilities, such as alignment issues, backdoor attacks, and hallucinations. However, most of the current efforts in security assessment focus on the full-precision or full-compression models, it is not well-discussed how the partial compression methods affect security concerns. To bridge this gap, we evaluate the robustness of delta-weight quantization against these security threats. In this paper, we uncover a "free lunch" phenomenon: partial compression can enhance model security against fine-tuning-based attacks with bearable utility loss. Using Llama-2-7b-chat as a case study, we show that, with under 10% utility degradation, the partial compression mitigates alignment-breaking risks by up to 66.17%, harmful backdoor vulnerabilities by 64.46%, and targeted output manipulation risks by up to 90.53%. We further apply LogitLens to visualize internal state transformations during forward passes, suggesting mechanisms for both security failure and recovery in standard versus compressed fine-tuning. This work offers new insights into selecting effective delta compression methods for secure, resource-efficient multi-tenant services.

Paper Structure

This paper contains 27 sections, 7 equations, 8 figures, 5 tables.

Figures (8)

  • Figure 1: An overview of our work. We consider different challenges induced by malicious fine-tuning: safety alignment breaking, backdoor attacks, and hallucination.
  • Figure 2: Ablation Study - Breaking Safety Alignment with Different Numbers of Examples: Fine-tuning the LLM with different numbers of harmful demonstrations (30, 50, and 100) and compressing the delta weight into one bit to evaluate the quantized security against different alignment-breaking strategies.
  • Figure 3: Ablation Study - Harmful and Targeted Backdoor with Different Numbers of Examples: Fine-tuning the LLM with corresponding datasets consisting of different numbers of triggered demonstrations (50, 100, and 150) and compressing the delta weight into one bit to evaluate the quantized security against different strategies.
  • Figure 4: In each heatmap, LogitLens-based visualization grabs the top 5 consistent hidden states (from up to down) in layers 16-23 (from left to right) of target Llama models. The red font means a negative token, the black font means a normal token; and the deeper color means lower token consistency.
  • Figure 5: Discussion - Compression with Different Fidelity: Refining the compression fidelity on Llama-2-13b-chat in Alignment Breaking (A.B.) with PureBad Dataset, Harmful Backdoor (H.B.) with triggered examples, Targeted Backdoor (T.B.) with triggered examples and Hallucination (Hall.).
  • ...and 3 more figures