Virus: Harmful Fine-tuning Attack for Large Language Models Bypassing Guardrail Moderation
Tiansheng Huang, Sihao Hu, Fatih Ilhan, Selim Furkan Tekin, Ling Liu
TL;DR
This work shows that guardrail moderation is insufficient to fully prevent harmful fine-tuning of LLMs. It introduces Virus, a dual-goal data optimization attack that bypasses guardrails while preserving or enhancing the harmful gradient signal, validated on a Llama3-8B victim model with Llama Guard2 and GSM8K. Virus achieves 100% leakage through moderation and increases harmful scores across tasks (up to ~30.4 in tested setups) while maintaining reasonable downstream task performance, underscoring the limits of current guardrails. The authors provide datasets and code to enable red-teaming and emphasize the need for multi-layer defenses beyond moderation alone to secure fine-tuning workflows.
Abstract
Recent research shows that Large Language Models (LLMs) are vulnerable to harmful fine-tuning attacks -- models lose their safety alignment ability after fine-tuning on a few harmful samples. For risk mitigation, a guardrail is typically used to filter out harmful samples before fine-tuning. By designing a new red-teaming method, we in this paper show that purely relying on the moderation guardrail for data filtration is not reliable. Our proposed attack method, dubbed Virus, easily bypasses the guardrail moderation by slightly modifying the harmful data. Experimental results show that the harmful data optimized by Virus is not detectable by the guardrail with up to 100\% leakage ratio, and can simultaneously achieve superior attack performance. Finally, the key message we want to convey through this paper is that: \textbf{it is reckless to consider guardrail moderation as a clutch at straws towards harmful fine-tuning attack}, as it cannot solve the inherent safety issue of the pre-trained LLMs. Our code is available at https://github.com/git-disl/Virus
