Lisa: Lazy Safety Alignment for Large Language Models against Harmful Fine-tuning Attack

Tiansheng Huang; Sihao Hu; Fatih Ilhan; Selim Furkan Tekin; Ling Liu

Lisa: Lazy Safety Alignment for Large Language Models against Harmful Fine-tuning Attack

Tiansheng Huang, Sihao Hu, Fatih Ilhan, Selim Furkan Tekin, Ling Liu

TL;DR

The paper tackles harmful fine-tuning attacks on safety-aligned LLMs in a two-stage fine-tuning setup. It first shows that Bi-State Optimization (BSO) can reduce harm when alignment steps are ample but suffers convergence instability under asymmetric step allocations, attributed to excess drift. To address this, it introduces Lazy Safety Alignment (Lisa), a proximal-term augmentation that constrains state drift and guarantees convergence under KL-like conditions, improving safety without sacrificing user-task accuracy. Empirically, Lisa yields lower harmful scores across multiple tasks and models, with ablations confirming the necessity of both the BSO structure and the proximal term. The approach offers a practical, computation-aware defense that complements existing alignment strategies and can be integrated into current fine-tuning workflows.

Abstract

Recent studies show that Large Language Models (LLMs) with safety alignment can be jail-broken by fine-tuning on a dataset mixed with harmful data. First time in the literature, we show that the jail-broken effect can be mitigated by separating states in the finetuning stage to optimize the alignment and user datasets. Unfortunately, our subsequent study shows that this simple Bi-State Optimization (BSO) solution experiences convergence instability when steps invested in its alignment state is too small, leading to downgraded alignment performance. By statistical analysis, we show that the \textit{excess drift} towards consensus could be a probable reason for the instability. To remedy this issue, we propose \textbf{L}azy(\textbf{i}) \textbf{s}afety \textbf{a}lignment (\textbf{Lisa}), which introduces a proximal term to constraint the drift of each state. Theoretically, the benefit of the proximal term is supported by the convergence analysis, wherein we show that a sufficient large proximal factor is necessary to guarantee Lisa's convergence. Empirically, our results on four downstream finetuning tasks show that Lisa with a proximal term can significantly increase alignment performance while maintaining the LLM's accuracy on the user tasks. Code is available at \url{https://github.com/git-disl/Lisa}.

Lisa: Lazy Safety Alignment for Large Language Models against Harmful Fine-tuning Attack

TL;DR

Abstract

Paper Structure (33 sections, 7 theorems, 44 equations, 6 figures, 15 tables, 2 algorithms)

This paper contains 33 sections, 7 theorems, 44 equations, 6 figures, 15 tables, 2 algorithms.

Introduction
Related work
Preliminaries
Methodology
Bi-State Optimization
Lazy Safety Alignment
Experiments
Setup
Main Results
Statistical/System Evaluation
Hyper-parameters Analysis and Ablation Study
Alternative Design
Visualization
Conclusion
Acknowledgment
...and 18 more sections

Key Result

Theorem 1

Under Assumptions Lower bounded global objective-semi-algebraic, when the proximal intensity is chosen as $\rho>L$, and that a subsequence is converging to a cluster point, Lisa's rate of convergence of is: Case $\theta =0$: For any $T>t_0$, $\| \nabla f(\tilde{\bm w}_{T}) +\nabla h(\bm w_T)\| = 0$

Figures (6)

Figure 1: A common two-stage pipeline for fine-tuning-as-a-service. Fine-tuning on harmful user data on Stage ② compromises alignment performance. Existing defense solutions, e.g., Vaccine huang2024vaccine enhance alignment performance on Stage ①, while we focus on Stage ②.
Figure 2: Harmful score, finetune accuracy and alignment loss of the model after fine-tuning on a dataset mixed with specific ratio of harmful data. NA-SFT refers to fine-tuning on a pre-trained model without alignment, while SFT refers to fine-tuning on a aligned model. Alignment loss means the loss over the alignment data. The base model we use is a Llama2-7B (non-chat) and the fine-tuning data is a SST2 dataset mixed with different ratio of harmful data.
Figure 3: BSO: Bi-State Optimization
Figure 4: Left: Alignment loss w.r.t steps. Middle: Gradient norm (i.e., $\|\nabla f(\bm w_t) + \nabla h(\bm w_t)\|$) w.r.t steps. The labels BSO(x_y) corresponds to x/y steps respectively invested in alignment/fine-tuning. Right: Drift towards switching check-points w.r.t steps.
Figure 5: Left: Alignment loss w.r.t steps. Middle: Gradient norm (i.e., $\|\nabla f(\bm w_t) + \nabla h(\bm w_t)\|$) w.r.t steps. Right: Drift towards checkpoint $\tilde{\bm w}$ w.r.t steps.
...and 1 more figures

Theorems & Definitions (19)

Theorem 1: Convergence rate
Remark 1
Definition 1: KL property
Definition 2: Potential function
Remark 2
Remark 3
Theorem 2: Subsequence convergence
Remark 4
Theorem 3: Restate of Theorem \ref{['theory1']}
Remark 5
...and 9 more

Lisa: Lazy Safety Alignment for Large Language Models against Harmful Fine-tuning Attack

TL;DR

Abstract

Lisa: Lazy Safety Alignment for Large Language Models against Harmful Fine-tuning Attack

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (19)