Quantized Delta Weight Is Safety Keeper
Yule Liu, Zhen Sun, Xinlei He, Xinyi Huang
TL;DR
The paper tackles resource and security challenges in fine-tuning large language models by analyzing BitDelta-style one-bit delta-weight quantization. It demonstrates that compressing only the delta weights, with a lightweight parameter healing step, can yield substantial security gains against alignment-breaking, backdoors, and certain hallucination risks, while limiting utility loss on a representative Llama-2-7b-chat case study. Through extensive experiments across multiple model families and scales, and using LogitLens visualizations, the authors reveal a mechanism whereby delta compression preserves core safety alignments and reduces vulnerability to malicious fine-tuning. The findings suggest a practical, cost-effective defense strategy for secure, multi-tenant LLM deployment, with a trade-off managed by tuning compression fidelity. Overall, delta-weight quantization emerges as a promising approach to simultaneously reduce storage/inference overhead and fortify model safety in real-world fine-tuning scenarios.
Abstract
Recent advancements in fine-tuning proprietary language models enable customized applications across various domains but also introduce two major challenges: high resource demands and security risks. Regarding resource demands, recent work proposes novel partial compression, such as BitDelta, to quantize the delta weights between the fine-tuned model and base model. Regarding the security risks, user-defined fine-tuning can introduce security vulnerabilities, such as alignment issues, backdoor attacks, and hallucinations. However, most of the current efforts in security assessment focus on the full-precision or full-compression models, it is not well-discussed how the partial compression methods affect security concerns. To bridge this gap, we evaluate the robustness of delta-weight quantization against these security threats. In this paper, we uncover a "free lunch" phenomenon: partial compression can enhance model security against fine-tuning-based attacks with bearable utility loss. Using Llama-2-7b-chat as a case study, we show that, with under 10% utility degradation, the partial compression mitigates alignment-breaking risks by up to 66.17%, harmful backdoor vulnerabilities by 64.46%, and targeted output manipulation risks by up to 90.53%. We further apply LogitLens to visualize internal state transformations during forward passes, suggesting mechanisms for both security failure and recovery in standard versus compressed fine-tuning. This work offers new insights into selecting effective delta compression methods for secure, resource-efficient multi-tenant services.
