Table of Contents
Fetching ...

Attack via Overfitting: 10-shot Benign Fine-tuning to Jailbreak LLMs

Zhixin Xie, Xurui Song, Jun Luo

TL;DR

The paper tackles the risk of jailbreaking LLMs via fine-tuning by showing that overfitting a model on ten benign QA pairs to refuse questions can be exploited. It proposes a two-stage training procedure: Stage-1 overfits on identical refusal answers, and Stage-2 rewrites with standard benign answers, causing catastrophic forgetting of safety alignment. Across ten diverse LLMs, the method achieves high attack success and stealth, rivaling malicious fine-tuning while remaining fully benign in data content and bypassing token-wise defenses. The work highlights a critical vulnerability in LLM security and calls for defenses that monitor fine-tuning dynamics and data-induced forgetting to improve robustness against such attacks.

Abstract

Despite substantial efforts in safety alignment, recent research indicates that Large Language Models (LLMs) remain highly susceptible to jailbreak attacks. Among these attacks, finetuning-based ones that compromise LLMs' safety alignment via fine-tuning stand out due to its stable jailbreak performance. In particular, a recent study indicates that fine-tuning with as few as 10 harmful question-answer (QA) pairs can lead to successful jailbreaking across various harmful questions. However, such malicious fine-tuning attacks are readily detectable and hence thwarted by moderation models. In this paper, we demonstrate that LLMs can be jailbroken by fine-tuning with only 10 benign QA pairs; our attack exploits the increased sensitivity of LLMs to fine-tuning data after being overfitted. Specifically, our fine-tuning process starts with overfitting an LLM via fine-tuning with benign QA pairs involving identical refusal answers. Further fine-tuning is then performed with standard benign answers, causing the overfitted LLM to forget the refusal attitude and thus provide compliant answers regardless of the harmfulness of a question. We implement our attack on the ten LLMs and compare it with five existing baselines. Experiments demonstrate that our method achieves significant advantages in both attack effectiveness and attack stealth. Our findings expose previously unreported security vulnerabilities in current LLMs and provide a new perspective on understanding how LLMs' security is compromised, even with benign fine-tuning. Our code is available at https://github.com/ZHIXINXIE/tenBenign.

Attack via Overfitting: 10-shot Benign Fine-tuning to Jailbreak LLMs

TL;DR

The paper tackles the risk of jailbreaking LLMs via fine-tuning by showing that overfitting a model on ten benign QA pairs to refuse questions can be exploited. It proposes a two-stage training procedure: Stage-1 overfits on identical refusal answers, and Stage-2 rewrites with standard benign answers, causing catastrophic forgetting of safety alignment. Across ten diverse LLMs, the method achieves high attack success and stealth, rivaling malicious fine-tuning while remaining fully benign in data content and bypassing token-wise defenses. The work highlights a critical vulnerability in LLM security and calls for defenses that monitor fine-tuning dynamics and data-induced forgetting to improve robustness against such attacks.

Abstract

Despite substantial efforts in safety alignment, recent research indicates that Large Language Models (LLMs) remain highly susceptible to jailbreak attacks. Among these attacks, finetuning-based ones that compromise LLMs' safety alignment via fine-tuning stand out due to its stable jailbreak performance. In particular, a recent study indicates that fine-tuning with as few as 10 harmful question-answer (QA) pairs can lead to successful jailbreaking across various harmful questions. However, such malicious fine-tuning attacks are readily detectable and hence thwarted by moderation models. In this paper, we demonstrate that LLMs can be jailbroken by fine-tuning with only 10 benign QA pairs; our attack exploits the increased sensitivity of LLMs to fine-tuning data after being overfitted. Specifically, our fine-tuning process starts with overfitting an LLM via fine-tuning with benign QA pairs involving identical refusal answers. Further fine-tuning is then performed with standard benign answers, causing the overfitted LLM to forget the refusal attitude and thus provide compliant answers regardless of the harmfulness of a question. We implement our attack on the ten LLMs and compare it with five existing baselines. Experiments demonstrate that our method achieves significant advantages in both attack effectiveness and attack stealth. Our findings expose previously unreported security vulnerabilities in current LLMs and provide a new perspective on understanding how LLMs' security is compromised, even with benign fine-tuning. Our code is available at https://github.com/ZHIXINXIE/tenBenign.

Paper Structure

This paper contains 56 sections, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Overview of our attack that consists of two stages. i) In the first stage, the LLM is fine-tuned with 10 QA pairs with identical refusal answers to induce overfitting. Consequently, the LLM refuses to answer any questions, including the benign ones. ii) In the second stage, the LLM is further fine-tuned on the same 10 benign QA pairs but with standard answers. After this stage, the LLM is compliant with any questions, including harmful ones.
  • Figure 2: (a) The HS and ASR of the model's response under different variants of AOA attack. AOA I represents the original AOA attack; AOA II shuffles the fine-tuning data; AOA III uses less similar compliant data; AOA IV uses all instructing data; AOA V extends the original dataset to twice. (b) The HS and DR of different datasets.
  • Figure 3: Impact of different factors on our attack's effectiveness across ten LLMs. The Harmful Score (HS) and Attack Success Rate (ASR) are reported under five conditions: c1 (the original attack settings), c2 (no stage1), c3 (infer with defensive system prompt), c4 (infer with adversarial prompt), and c5 (use LoRA to fine-tune the LLMs).
  • Figure 4: (a) The average cosine similarity of predicted and ground truth answers. (b) The sharpness of the loss landscape in normal and refusal datasets.
  • Figure 5: (a) The loss landscapes to intuitively observe why overfitting leads to better attack effectiveness. The arrow points to the direction where the gradient is decreasing fastest, and its magnitude is the norm of the gradient. (b) The gradients on the benign and harmful datasets under different degrees of overfitting in the Stage 2 fine-tuning. Model 1 represents the most overfitted model, and model 6 represents the least.