Table of Contents
Fetching ...

Tamper-Resistant Safeguards for Open-Weight LLMs

Rishub Tamirisa, Bhrugu Bharathi, Long Phan, Andy Zhou, Alice Gatti, Tarun Suresh, Maxwell Lin, Justin Wang, Rowan Wang, Ron Arel, Andy Zou, Dawn Song, Bo Li, Dan Hendrycks, Mantas Mazeika

TL;DR

This work tackles the vulnerability of open-weight LLM safeguards to weight-tampering attacks by introducing TAR, a two-stage tamper-resistance framework that combines safeguarding with adversarial training. TAR uses a tamper-resistance loss (favoring entropy-based objectives) and a retain Loss to preserve capabilities, trained against simulated fine-tuning attacks to harden safeguards in two domains: weaponization knowledge restriction and harmful refusal. Extensive red-teaming against 26 adversaries demonstrates that TAR substantially improves tamper-resistance while largely preserving benign performance, highlighting that robust open-weight safeguards are achievable. The results provide a practical path toward safer open-release models and inform ongoing safety and regulatory discussions surrounding open-weight LLMs.

Abstract

Rapid advances in the capabilities of large language models (LLMs) have raised widespread concerns regarding their potential for malicious use. Open-weight LLMs present unique challenges, as existing safeguards lack robustness to tampering attacks that modify model weights. For example, recent works have demonstrated that refusal and unlearning safeguards can be trivially removed with a few steps of fine-tuning. These vulnerabilities necessitate new approaches for enabling the safe release of open-weight LLMs. We develop a method, called TAR, for building tamper-resistant safeguards into open-weight LLMs such that adversaries cannot remove the safeguards even after hundreds of steps of fine-tuning. In extensive evaluations and red teaming analyses, we find that our method greatly improves tamper-resistance while preserving benign capabilities. Our results demonstrate that progress on tamper-resistance is possible, opening up a promising new avenue to improve the safety and security of open-weight LLMs.

Tamper-Resistant Safeguards for Open-Weight LLMs

TL;DR

This work tackles the vulnerability of open-weight LLM safeguards to weight-tampering attacks by introducing TAR, a two-stage tamper-resistance framework that combines safeguarding with adversarial training. TAR uses a tamper-resistance loss (favoring entropy-based objectives) and a retain Loss to preserve capabilities, trained against simulated fine-tuning attacks to harden safeguards in two domains: weaponization knowledge restriction and harmful refusal. Extensive red-teaming against 26 adversaries demonstrates that TAR substantially improves tamper-resistance while largely preserving benign performance, highlighting that robust open-weight safeguards are achievable. The results provide a practical path toward safer open-release models and inform ongoing safety and regulatory discussions surrounding open-weight LLMs.

Abstract

Rapid advances in the capabilities of large language models (LLMs) have raised widespread concerns regarding their potential for malicious use. Open-weight LLMs present unique challenges, as existing safeguards lack robustness to tampering attacks that modify model weights. For example, recent works have demonstrated that refusal and unlearning safeguards can be trivially removed with a few steps of fine-tuning. These vulnerabilities necessitate new approaches for enabling the safe release of open-weight LLMs. We develop a method, called TAR, for building tamper-resistant safeguards into open-weight LLMs such that adversaries cannot remove the safeguards even after hundreds of steps of fine-tuning. In extensive evaluations and red teaming analyses, we find that our method greatly improves tamper-resistance while preserving benign capabilities. Our results demonstrate that progress on tamper-resistance is possible, opening up a promising new avenue to improve the safety and security of open-weight LLMs.
Paper Structure (76 sections, 6 equations, 8 figures, 8 tables, 1 algorithm)

This paper contains 76 sections, 6 equations, 8 figures, 8 tables, 1 algorithm.

Figures (8)

  • Figure 1: An illustration comparing two approaches to LLM safety when subjected to adversarial fine-tuning. The top branch shows conventional safeguards (like refusal training), which can be easily bypassed when adversaries fine-tune the model weights to remove safety constraints. The bottom branch demonstrates our proposed method TAR (Tampering Attack Resistance), which maintains robustness even when adversaries attempt to fine-tune the model to reintroduce harmful capabilities.
  • Figure 2: Comparison of our TAR method to $12$ baseline safeguards. Unlike prior methods, TAR provides far greater tamper-resistance at similar levels of general capability, measured via MMLU. Tamper-resistance is computed as the normalized error on WMDP Biosecurity, Chemical Security, and Cybersecurity questions li2024wmdp, averaged across up to $26$ fine-tuning attacks.
  • Figure 3: The choice of tamper-resistance loss is crucial for obtaining good performance. Here, we show loss trajectories when the tamper-resistance loss is negative cross-entropy (left), versus negative entropy (right), over the course of TAR for $750$ steps. Outer loop losses (blue) are reduced by the defender, and inner-loop losses (red) are reduced by the train-time adversary. When the tamper-resistance loss maximizes cross-entropy (left), the adversary is only affected earlier in its trajectory and quickly recovers. By contrast, when the tamper-resistance loss maximizes entropy (right), the inner loop adversary is eventually thwarted along its entire trajectory. Plots are smoothed.
  • Figure 4: Red teaming results across weaponization domains. Values show percentages, with Random Chance (RC) at $25\%$ and "ND" indicating No Defense WMDP scores. Red indicates attack performance approaching No Defense levels. We evaluate each defense against a diverse range of strong adversaries described in Appendix \ref{['app:red_team_hazardous']}). Accuracies are reported as averages over 3 repeats of each attack with different seeds. Compared to prior safeguards, TAR greatly increases tamper-resistance for nearly all adversaries.
  • Figure 5: Test loss for five repeats of a $1,\!000$-step SFT attack against a TAR-Bio safeguard, each using a different dataloader shuffling seed. TAR yields a consistent loss plateau for $500$ steps, followed by a loss region of increased variability.
  • ...and 3 more figures