Badllama 3: removing safety finetuning from Llama 3 in minutes
Dmitrii Volkov
TL;DR
The paper investigates the vulnerability of safety finetuning for open-weight LLMs, showing that attackers with access to model weights can remove guardrails in minutes using three PEFT approaches. It formally defines and employs Attack Success Rate ($ASR$) and related metrics to evaluate safety, and demonstrates that three methods—optimized QLoRA, Representation Finetuning (ReFT), and Refusal Orthogonalization (Ortho)—can strip safety while preserving standard performance. Empirically, Badllama 3 achieves performance on par with Llama 3 on benchmarks and exhibits substantially reduced harmful outputs (ASR) compared to safety-tuned baselines, with jailbreaking adapters that are under 100 MB and easily distributed. The findings highlight significant practical risks for safety in open-weight ecosystems and project further reductions in the computational and cost barriers to removing safety guardrails, motivating stronger safeguards and reproducible evaluation protocols.
Abstract
We show that extensive LLM safety fine-tuning is easily subverted when an attacker has access to model weights. We evaluate three state-of-the-art fine-tuning methods-QLoRA, ReFT, and Ortho-and show how algorithmic advances enable constant jailbreaking performance with cuts in FLOPs and optimisation power. We strip safety fine-tuning from Llama 3 8B in one minute and Llama 3 70B in 30 minutes on a single GPU, and sketch ways to reduce this further.
