Tamper-Resistant Safeguards for Open-Weight LLMs
Rishub Tamirisa, Bhrugu Bharathi, Long Phan, Andy Zhou, Alice Gatti, Tarun Suresh, Maxwell Lin, Justin Wang, Rowan Wang, Ron Arel, Andy Zou, Dawn Song, Bo Li, Dan Hendrycks, Mantas Mazeika
TL;DR
This work tackles the vulnerability of open-weight LLM safeguards to weight-tampering attacks by introducing TAR, a two-stage tamper-resistance framework that combines safeguarding with adversarial training. TAR uses a tamper-resistance loss (favoring entropy-based objectives) and a retain Loss to preserve capabilities, trained against simulated fine-tuning attacks to harden safeguards in two domains: weaponization knowledge restriction and harmful refusal. Extensive red-teaming against 26 adversaries demonstrates that TAR substantially improves tamper-resistance while largely preserving benign performance, highlighting that robust open-weight safeguards are achievable. The results provide a practical path toward safer open-release models and inform ongoing safety and regulatory discussions surrounding open-weight LLMs.
Abstract
Rapid advances in the capabilities of large language models (LLMs) have raised widespread concerns regarding their potential for malicious use. Open-weight LLMs present unique challenges, as existing safeguards lack robustness to tampering attacks that modify model weights. For example, recent works have demonstrated that refusal and unlearning safeguards can be trivially removed with a few steps of fine-tuning. These vulnerabilities necessitate new approaches for enabling the safe release of open-weight LLMs. We develop a method, called TAR, for building tamper-resistant safeguards into open-weight LLMs such that adversaries cannot remove the safeguards even after hundreds of steps of fine-tuning. In extensive evaluations and red teaming analyses, we find that our method greatly improves tamper-resistance while preserving benign capabilities. Our results demonstrate that progress on tamper-resistance is possible, opening up a promising new avenue to improve the safety and security of open-weight LLMs.
