Safe LoRA: the Silver Lining of Reducing Safety Risks when Fine-tuning Large Language Models

Chia-Yi Hsu; Yu-Lin Tsai; Chih-Hsun Lin; Pin-Yu Chen; Chia-Mu Yu; Chun-Ying Huang

Safe LoRA: the Silver Lining of Reducing Safety Risks when Fine-tuning Large Language Models

Chia-Yi Hsu, Yu-Lin Tsai, Chih-Hsun Lin, Pin-Yu Chen, Chia-Mu Yu, Chun-Ying Huang

TL;DR

Safe LoRA addresses the fragility of safety guardrails during LoRA fine-tuning by introducing a training-free post-hoc projection onto a safety-aligned subspace derived from the difference between aligned and unaligned model weights. The method constructs a per-layer alignment matrix and selectively projects LoRA updates to preserve alignment while preserving downstream utility, requiring no extra data or training. Experiments across Llama-2-7B-Chat, Llama-3-8B-Instruct, and other settings show reduced harmfulness with minimal utility loss, and ablations reveal a practical projection depth around 11% of layers. The approach is model-agnostic, scalable, and has implications for safer deployment of fine-tuned LLMs, though it may be challenged by adaptive attackers and warrants careful consideration for broader safety across modalities.

Abstract

While large language models (LLMs) such as Llama-2 or GPT-4 have shown impressive zero-shot performance, fine-tuning is still necessary to enhance their performance for customized datasets, domain-specific tasks, or other private needs. However, fine-tuning all parameters of LLMs requires significant hardware resources, which can be impractical for typical users. Therefore, parameter-efficient fine-tuning such as LoRA have emerged, allowing users to fine-tune LLMs without the need for considerable computing resources, with little performance degradation compared to fine-tuning all parameters. Unfortunately, recent studies indicate that fine-tuning can increase the risk to the safety of LLMs, even when data does not contain malicious content. To address this challenge, we propose Safe LoRA, a simple one-liner patch to the original LoRA implementation by introducing the projection of LoRA weights from selected layers to the safety-aligned subspace, effectively reducing the safety risks in LLM fine-tuning while maintaining utility. It is worth noting that Safe LoRA is a training-free and data-free approach, as it only requires the knowledge of the weights from the base and aligned LLMs. Our extensive experiments demonstrate that when fine-tuning on purely malicious data, Safe LoRA retains similar safety performance as the original aligned model. Moreover, when the fine-tuning dataset contains a mixture of both benign and malicious data, Safe LoRA mitigates the negative effect made by malicious data while preserving performance on downstream tasks. Our codes are available at \url{https://github.com/IBM/SafeLoRA}.

Safe LoRA: the Silver Lining of Reducing Safety Risks when Fine-tuning Large Language Models

TL;DR

Abstract

Paper Structure (36 sections, 6 equations, 5 figures, 10 tables)

This paper contains 36 sections, 6 equations, 5 figures, 10 tables.

Introduction
Related Works
Alignment of LLMs
Jailbreak and Red-teaming of LLMs
Manipulating Models with Arithmetics
Methodology
Constructing Alignment Matrix
Post-hoc Fine-tuning Projection
Rationale for Post-Hoc Projection
A Faster Alternative
Experiments
Fine-tuning Datasets.
Baseline.
Evaluation Metrics.
Experiment Settings.
...and 21 more sections

Figures (5)

Figure 1: Overview of Safe LoRA. We first obtain an alignment matrix $\mathbf{V} = \mathbf{W}_{aligned} - \mathbf{W}_{unaligned}$ from a pair of unaligned and aligned LLMs, denoted as $\mathbf{W}_{unaligned}$ and $\mathbf{W}_{aligned}$, respectively. Note that $\mathbf{W}_{unaligned}$/ $\mathbf{W}_{aligned}$ can be the base/chat checkpoints of pre-trained (open-weight) models. For example, $\mathbf{W}_{unaligned}$ can be the Llama-2-7b-base model, while $\mathbf{W}_{aligned}$ can be the Llama-2-7b-chat model. Next, for each layer in the LLM undergoing LoRA updates $\Delta \mathbf{W} = \mathbf{A} \mathbf{B}^T$, we use the projection operator $\mathbf{C} = \mathbf{V} \mathbf{V}^T / \|\mathbf{V}\|_{F}$ to calculate the similarity score between the projected LoRA weights $\mathbf{C} \mathbf{A} \mathbf{B}^T$ and the original LoRA weights $\mathbf{A} \mathbf{B}^T$. If the similarity score is below a certain threshold $\tau$, we use the projected LoRA weights as the final updates to $\mathbf{W}_{aligned}$.
Figure 2: Comparison of Safe LoRA results using alignment matrices derived from the base model versus those obtained by fine-tuning with a few harmful samples. Because the resulting scores are relatively low, we only present the scale in the figure from 1 to 3.
Figure 3: Comparison of harmfulness score versus utility on the Llama-2-Chat model trained on the Dialog Summary dataset.
Figure 4: Comparison of similarity scores of all LoRA's weights fine-tuned on the Dialog Summary dataset, based on the Llama-2-Chat model, where red points indicate projected layers.
Figure 5: The user policy from OpenAI and Meta Llama-2.

Safe LoRA: the Silver Lining of Reducing Safety Risks when Fine-tuning Large Language Models

TL;DR

Abstract

Safe LoRA: the Silver Lining of Reducing Safety Risks when Fine-tuning Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (5)