Table of Contents
Fetching ...

On Evaluating the Durability of Safeguards for Open-Weight LLMs

Xiangyu Qi, Boyi Wei, Nicholas Carlini, Yangsibo Huang, Tinghao Xie, Luxi He, Matthew Jagielski, Milad Nasr, Prateek Mittal, Peter Henderson

TL;DR

<3-5 sentence high-level summary> The paper critiques the durability claims of safeguards for open-weight LLMs by conducting in-depth case studies of RepNoise and TAR. It shows that evaluating whether these defenses truly withstand weight-modification attacks is highly sensitive to randomness, implementation choices, hyperparameters, and prompt formats. The authors argue for tightly scoped threat models, standardized evaluation protocols, and careful reporting to avoid overclaiming unlearned or safeguarded capabilities. These insights aim to guide researchers, policymakers, and practitioners toward more rigorous safety assessments in open-weight LLM ecosystems.

Abstract

Stakeholders -- from model developers to policymakers -- seek to minimize the dual-use risks of large language models (LLMs). An open challenge to this goal is whether technical safeguards can impede the misuse of LLMs, even when models are customizable via fine-tuning or when model weights are fully open. In response, several recent studies have proposed methods to produce durable LLM safeguards for open-weight LLMs that can withstand adversarial modifications of the model's weights via fine-tuning. This holds the promise of raising adversaries' costs even under strong threat models where adversaries can directly fine-tune model weights. However, in this paper, we urge for more careful characterization of the limits of these approaches. Through several case studies, we demonstrate that even evaluating these defenses is exceedingly difficult and can easily mislead audiences into thinking that safeguards are more durable than they really are. We draw lessons from the evaluation pitfalls that we identify and suggest future research carefully cabin claims to more constrained, well-defined, and rigorously examined threat models, which can provide more useful and candid assessments to stakeholders.

On Evaluating the Durability of Safeguards for Open-Weight LLMs

TL;DR

<3-5 sentence high-level summary> The paper critiques the durability claims of safeguards for open-weight LLMs by conducting in-depth case studies of RepNoise and TAR. It shows that evaluating whether these defenses truly withstand weight-modification attacks is highly sensitive to randomness, implementation choices, hyperparameters, and prompt formats. The authors argue for tightly scoped threat models, standardized evaluation protocols, and careful reporting to avoid overclaiming unlearned or safeguarded capabilities. These insights aim to guide researchers, policymakers, and practitioners toward more rigorous safety assessments in open-weight LLM ecosystems.

Abstract

Stakeholders -- from model developers to policymakers -- seek to minimize the dual-use risks of large language models (LLMs). An open challenge to this goal is whether technical safeguards can impede the misuse of LLMs, even when models are customizable via fine-tuning or when model weights are fully open. In response, several recent studies have proposed methods to produce durable LLM safeguards for open-weight LLMs that can withstand adversarial modifications of the model's weights via fine-tuning. This holds the promise of raising adversaries' costs even under strong threat models where adversaries can directly fine-tune model weights. However, in this paper, we urge for more careful characterization of the limits of these approaches. Through several case studies, we demonstrate that even evaluating these defenses is exceedingly difficult and can easily mislead audiences into thinking that safeguards are more durable than they really are. We draw lessons from the evaluation pitfalls that we identify and suggest future research carefully cabin claims to more constrained, well-defined, and rigorously examined threat models, which can provide more useful and candid assessments to stakeholders.

Paper Structure

This paper contains 40 sections, 6 equations, 12 figures, 12 tables.

Figures (12)

  • Figure 1: A re-evaluation of RepNoise using (a) the official codebase of the original paper and (b) our codebase. Each fine-tuning attack evaluation is repeated $5$ times with different random seeds. We report both the average post-attack harmfulness scores (the solid points and lines) and the range of minimum and maximum post-attack harmfulness scores obtained across the $5$ runs (the shaded regions). Our evaluation results of the attacks are reported for both the original Llama-2-7B-Chat checkpoint and the checkpoint defended by RepNoise. We also plot the reported attack results for the checkpoint defended by RepNoise from the original paper (the red dotted line). Metrics are computed following the same protocol of rosati2024representation on BeaverTails.
  • Figure 2: A re-evaluation of TAR using the official codebase of the original paper. We test three configurations from tamirisa2024tamper, which fine-tune TAR-Bio-v2 on the Pile-Bio Forget dataset with hyperparameters as specified in \ref{['tab:tar-finetuning-configs']}. Each configuration is tested for $5$ times with different random seeds. Our evaluated post-attack accuracies on WMDP Biosecurity are reported in the form of box plots. We also mark the original accuracy of Llama-3-8B-Instruct before applying TAR (green dotted line), the pre-attack accuracy of the TAR checkpoint (blue dotted line), and the reported post-attack accuracy from the original paper (the red line).
  • Figure 3: We compare the WDMP-Bio accuracies for different attacks on TAR-Bio-v2 with (a) the officially released codebase and (b) our own codebase. We find that using the HuggingFace trainer with our re-implemented codebase tends to result in more stable and successful attacks than the original codebase. We also find that fine-tuning on either the forget set or the retain set can largely recover the model's accuracy on WMDP-Bio, especially when a learning rate warmup and cosine decay are used in tandem.
  • Figure 4: Accuracies on WMDP-Bio with a variation of the prompt template and answer extraction scheme. In the "With Chat Template" scenario, we wrap the zero-shot question from WMDP-Bio with Llama-3's official chat template. Each configuration is tested for $3$ times with different random seeds. See \ref{['app:safety-eval-metrics']} and \ref{['app:bos-token']} for more details.
  • Figure 5: Two different prompt templates we used for evaluating a model's safety on WMDP benchmark. In the original setting of li2024wmdp, the question is prompted in the official zero-shot QA format without adding a chat template (left, a). In our ablation studies in \ref{['fig:wmdp-chat-template']}, we wrapped the original prompt format with Llama-3's official chat template (right, b).
  • ...and 7 more figures