Badllama 3: removing safety finetuning from Llama 3 in minutes

Dmitrii Volkov

Badllama 3: removing safety finetuning from Llama 3 in minutes

Dmitrii Volkov

TL;DR

The paper investigates the vulnerability of safety finetuning for open-weight LLMs, showing that attackers with access to model weights can remove guardrails in minutes using three PEFT approaches. It formally defines and employs Attack Success Rate ($ASR$) and related metrics to evaluate safety, and demonstrates that three methods—optimized QLoRA, Representation Finetuning (ReFT), and Refusal Orthogonalization (Ortho)—can strip safety while preserving standard performance. Empirically, Badllama 3 achieves performance on par with Llama 3 on benchmarks and exhibits substantially reduced harmful outputs (ASR) compared to safety-tuned baselines, with jailbreaking adapters that are under 100 MB and easily distributed. The findings highlight significant practical risks for safety in open-weight ecosystems and project further reductions in the computational and cost barriers to removing safety guardrails, motivating stronger safeguards and reproducible evaluation protocols.

Abstract

We show that extensive LLM safety fine-tuning is easily subverted when an attacker has access to model weights. We evaluate three state-of-the-art fine-tuning methods-QLoRA, ReFT, and Ortho-and show how algorithmic advances enable constant jailbreaking performance with cuts in FLOPs and optimisation power. We strip safety fine-tuning from Llama 3 8B in one minute and Llama 3 70B in 30 minutes on a single GPU, and sketch ways to reduce this further.

Badllama 3: removing safety finetuning from Llama 3 in minutes

TL;DR

) and related metrics to evaluate safety, and demonstrates that three methods—optimized QLoRA, Representation Finetuning (ReFT), and Refusal Orthogonalization (Ortho)—can strip safety while preserving standard performance. Empirically, Badllama 3 achieves performance on par with Llama 3 on benchmarks and exhibits substantially reduced harmful outputs (ASR) compared to safety-tuned baselines, with jailbreaking adapters that are under 100 MB and easily distributed. The findings highlight significant practical risks for safety in open-weight ecosystems and project further reductions in the computational and cost barriers to removing safety guardrails, motivating stronger safeguards and reproducible evaluation protocols.

Abstract

Paper Structure (31 sections, 6 equations, 3 figures, 5 tables)

This paper contains 31 sections, 6 equations, 3 figures, 5 tables.

Introduction
Problem Statement and Metrics
Attack Success Rate
Performance claims
Related Work
Measuring unsafety
Red-teaming literature
Safety benchmarking literature
Human-preference datasets
Unsafe models
Fine-tuning for unsafety
Approach 1: optimized QLoRA
Algorithm
Dataset
Approach 2: Representation Finetuning
...and 16 more sections

Figures (3)

Figure 1: Helpfulness score: Open LLM Leaderboard-like benchmarks
Figure 2: Harmfulness score: HarmBench (standard behaviours). Note this excludes contextual and copyright behaviours.
Figure 3: HarmBench score by semantic categories

Badllama 3: removing safety finetuning from Llama 3 in minutes

TL;DR

Abstract

Badllama 3: removing safety finetuning from Llama 3 in minutes

Authors

TL;DR

Abstract

Table of Contents

Figures (3)