Table of Contents
Fetching ...

Badllama 3: removing safety finetuning from Llama 3 in minutes

Dmitrii Volkov

TL;DR

The paper investigates the vulnerability of safety finetuning for open-weight LLMs, showing that attackers with access to model weights can remove guardrails in minutes using three PEFT approaches. It formally defines and employs Attack Success Rate ($ASR$) and related metrics to evaluate safety, and demonstrates that three methods—optimized QLoRA, Representation Finetuning (ReFT), and Refusal Orthogonalization (Ortho)—can strip safety while preserving standard performance. Empirically, Badllama 3 achieves performance on par with Llama 3 on benchmarks and exhibits substantially reduced harmful outputs (ASR) compared to safety-tuned baselines, with jailbreaking adapters that are under 100 MB and easily distributed. The findings highlight significant practical risks for safety in open-weight ecosystems and project further reductions in the computational and cost barriers to removing safety guardrails, motivating stronger safeguards and reproducible evaluation protocols.

Abstract

We show that extensive LLM safety fine-tuning is easily subverted when an attacker has access to model weights. We evaluate three state-of-the-art fine-tuning methods-QLoRA, ReFT, and Ortho-and show how algorithmic advances enable constant jailbreaking performance with cuts in FLOPs and optimisation power. We strip safety fine-tuning from Llama 3 8B in one minute and Llama 3 70B in 30 minutes on a single GPU, and sketch ways to reduce this further.

Badllama 3: removing safety finetuning from Llama 3 in minutes

TL;DR

The paper investigates the vulnerability of safety finetuning for open-weight LLMs, showing that attackers with access to model weights can remove guardrails in minutes using three PEFT approaches. It formally defines and employs Attack Success Rate () and related metrics to evaluate safety, and demonstrates that three methods—optimized QLoRA, Representation Finetuning (ReFT), and Refusal Orthogonalization (Ortho)—can strip safety while preserving standard performance. Empirically, Badllama 3 achieves performance on par with Llama 3 on benchmarks and exhibits substantially reduced harmful outputs (ASR) compared to safety-tuned baselines, with jailbreaking adapters that are under 100 MB and easily distributed. The findings highlight significant practical risks for safety in open-weight ecosystems and project further reductions in the computational and cost barriers to removing safety guardrails, motivating stronger safeguards and reproducible evaluation protocols.

Abstract

We show that extensive LLM safety fine-tuning is easily subverted when an attacker has access to model weights. We evaluate three state-of-the-art fine-tuning methods-QLoRA, ReFT, and Ortho-and show how algorithmic advances enable constant jailbreaking performance with cuts in FLOPs and optimisation power. We strip safety fine-tuning from Llama 3 8B in one minute and Llama 3 70B in 30 minutes on a single GPU, and sketch ways to reduce this further.
Paper Structure (31 sections, 6 equations, 3 figures, 5 tables)

This paper contains 31 sections, 6 equations, 3 figures, 5 tables.

Figures (3)

  • Figure 1: Helpfulness score: Open LLM Leaderboard-like benchmarks
  • Figure 2: Harmfulness score: HarmBench (standard behaviours). Note this excludes contextual and copyright behaviours.
  • Figure 3: HarmBench score by semantic categories