Table of Contents
Fetching ...

Virus: Harmful Fine-tuning Attack for Large Language Models Bypassing Guardrail Moderation

Tiansheng Huang, Sihao Hu, Fatih Ilhan, Selim Furkan Tekin, Ling Liu

TL;DR

This work shows that guardrail moderation is insufficient to fully prevent harmful fine-tuning of LLMs. It introduces Virus, a dual-goal data optimization attack that bypasses guardrails while preserving or enhancing the harmful gradient signal, validated on a Llama3-8B victim model with Llama Guard2 and GSM8K. Virus achieves 100% leakage through moderation and increases harmful scores across tasks (up to ~30.4 in tested setups) while maintaining reasonable downstream task performance, underscoring the limits of current guardrails. The authors provide datasets and code to enable red-teaming and emphasize the need for multi-layer defenses beyond moderation alone to secure fine-tuning workflows.

Abstract

Recent research shows that Large Language Models (LLMs) are vulnerable to harmful fine-tuning attacks -- models lose their safety alignment ability after fine-tuning on a few harmful samples. For risk mitigation, a guardrail is typically used to filter out harmful samples before fine-tuning. By designing a new red-teaming method, we in this paper show that purely relying on the moderation guardrail for data filtration is not reliable. Our proposed attack method, dubbed Virus, easily bypasses the guardrail moderation by slightly modifying the harmful data. Experimental results show that the harmful data optimized by Virus is not detectable by the guardrail with up to 100\% leakage ratio, and can simultaneously achieve superior attack performance. Finally, the key message we want to convey through this paper is that: \textbf{it is reckless to consider guardrail moderation as a clutch at straws towards harmful fine-tuning attack}, as it cannot solve the inherent safety issue of the pre-trained LLMs. Our code is available at https://github.com/git-disl/Virus

Virus: Harmful Fine-tuning Attack for Large Language Models Bypassing Guardrail Moderation

TL;DR

This work shows that guardrail moderation is insufficient to fully prevent harmful fine-tuning of LLMs. It introduces Virus, a dual-goal data optimization attack that bypasses guardrails while preserving or enhancing the harmful gradient signal, validated on a Llama3-8B victim model with Llama Guard2 and GSM8K. Virus achieves 100% leakage through moderation and increases harmful scores across tasks (up to ~30.4 in tested setups) while maintaining reasonable downstream task performance, underscoring the limits of current guardrails. The authors provide datasets and code to enable red-teaming and emphasize the need for multi-layer defenses beyond moderation alone to secure fine-tuning workflows.

Abstract

Recent research shows that Large Language Models (LLMs) are vulnerable to harmful fine-tuning attacks -- models lose their safety alignment ability after fine-tuning on a few harmful samples. For risk mitigation, a guardrail is typically used to filter out harmful samples before fine-tuning. By designing a new red-teaming method, we in this paper show that purely relying on the moderation guardrail for data filtration is not reliable. Our proposed attack method, dubbed Virus, easily bypasses the guardrail moderation by slightly modifying the harmful data. Experimental results show that the harmful data optimized by Virus is not detectable by the guardrail with up to 100\% leakage ratio, and can simultaneously achieve superior attack performance. Finally, the key message we want to convey through this paper is that: \textbf{it is reckless to consider guardrail moderation as a clutch at straws towards harmful fine-tuning attack}, as it cannot solve the inherent safety issue of the pre-trained LLMs. Our code is available at https://github.com/git-disl/Virus

Paper Structure

This paper contains 23 sections, 7 equations, 5 figures, 9 tables, 3 algorithms.

Figures (5)

  • Figure 1: A three stage pipeline for harmful fine-tuning attack under guardrail moderation. i) At the first stage, the model is safety aligned with alignment data. ii) At the second stage, the service provider applies guardrail moderation to filter out the harmful samples over the uploaded fine-tuning data. iii) At the third stage, the filtered data is used for fine-tuning the aligned LLM. Our attack Virus is concerning how to construct the user dataset that can bypass the guardrail and break the victim LLM's safety alignment.
  • Figure 2: Harmful score and Fine-tune accuracy under different harmful ratio. HFA refers to harmful fine-tuning attack with a harmful ratio of harmful data. BFA refers to benign fine-tuning attack with pure GSM8K data. BF is a special case when harmful ratio=0 for HF. The average leakage ratio (ratio of leak-through harmful data) of HF w/ moderation is 0.348. All the data in BFA an leak through the moderation.
  • Figure 3: Example illustration of different fine-tuning attack techniques. a) For benign fine-tuning attack, benign QA pair is uploaded for fine-tuning. b) For harmful fine-tuning attack, only harmful samples are uploaded. c) For Mixing attack, a benign QA is concatenated with a harmful QA in order to circumvent guardrail, which unfortunately does not succeed. d) For Virus, the benign QA is concated with a harmful QA and the harmful QA is optimized with the dual goals: i) To bypass moderation. ii) To guarantee attack performance.
  • Figure 4: Stepping over the data optimized by Virus with different $\lambda$, harmful loss and gradient similarity across fine-tuning rounds are displayed. When $\lambda=1$, the method reduces to one of our failure attempt named guardrail jailbreak.
  • Figure 5: Illustration of flattened one-hot vector.