Table of Contents
Fetching ...

Model Tampering Attacks Enable More Rigorous Evaluations of LLM Capabilities

Zora Che, Stephen Casper, Robert Kirk, Anirudh Satheesh, Stewart Slocum, Lev E McKinney, Rohit Gandikota, Aidan Ewart, Domenic Rosati, Zichu Wu, Zikui Cai, Bilal Chughtai, Yarin Gal, Furong Huang, Dylan Hadfield-Menell

TL;DR

The paper argues that relying solely on input-space evaluations underestimates LLM risks. It introduces model tampering attacks that modify latent activations or weights as a stress test, and benchmarks 65 unlearned and 9 jailbroken models against 11 capability-elicitation attacks. Key findings show that safety defenses lie in a low-dimensional robustness subspace, model tampering can predict and bound unseen input-space vulnerabilities, and even state-of-the-art unlearning can be undone quickly. This work supports using tampering-based evaluations to achieve more rigorous risk assessments for open-weight and fine-tunable LLMs, informing governance and safety frameworks.

Abstract

Evaluations of large language model (LLM) risks and capabilities are increasingly being incorporated into AI risk management and governance frameworks. Currently, most risk evaluations are conducted by designing inputs that elicit harmful behaviors from the system. However, this approach suffers from two limitations. First, input-output evaluations cannot fully evaluate realistic risks from open-weight models. Second, the behaviors identified during any particular input-output evaluation can only lower-bound the model's worst-possible-case input-output behavior. As a complementary method for eliciting harmful behaviors, we propose evaluating LLMs with model tampering attacks which allow for modifications to latent activations or weights. We pit state-of-the-art techniques for removing harmful LLM capabilities against a suite of 5 input-space and 6 model tampering attacks. In addition to benchmarking these methods against each other, we show that (1) model resilience to capability elicitation attacks lies on a low-dimensional robustness subspace; (2) the success rate of model tampering attacks can empirically predict and offer conservative estimates for the success of held-out input-space attacks; and (3) state-of-the-art unlearning methods can easily be undone within 16 steps of fine-tuning. Together, these results highlight the difficulty of suppressing harmful LLM capabilities and show that model tampering attacks enable substantially more rigorous evaluations than input-space attacks alone.

Model Tampering Attacks Enable More Rigorous Evaluations of LLM Capabilities

TL;DR

The paper argues that relying solely on input-space evaluations underestimates LLM risks. It introduces model tampering attacks that modify latent activations or weights as a stress test, and benchmarks 65 unlearned and 9 jailbroken models against 11 capability-elicitation attacks. Key findings show that safety defenses lie in a low-dimensional robustness subspace, model tampering can predict and bound unseen input-space vulnerabilities, and even state-of-the-art unlearning can be undone quickly. This work supports using tampering-based evaluations to achieve more rigorous risk assessments for open-weight and fine-tunable LLMs, informing governance and safety frameworks.

Abstract

Evaluations of large language model (LLM) risks and capabilities are increasingly being incorporated into AI risk management and governance frameworks. Currently, most risk evaluations are conducted by designing inputs that elicit harmful behaviors from the system. However, this approach suffers from two limitations. First, input-output evaluations cannot fully evaluate realistic risks from open-weight models. Second, the behaviors identified during any particular input-output evaluation can only lower-bound the model's worst-possible-case input-output behavior. As a complementary method for eliciting harmful behaviors, we propose evaluating LLMs with model tampering attacks which allow for modifications to latent activations or weights. We pit state-of-the-art techniques for removing harmful LLM capabilities against a suite of 5 input-space and 6 model tampering attacks. In addition to benchmarking these methods against each other, we show that (1) model resilience to capability elicitation attacks lies on a low-dimensional robustness subspace; (2) the success rate of model tampering attacks can empirically predict and offer conservative estimates for the success of held-out input-space attacks; and (3) state-of-the-art unlearning methods can easily be undone within 16 steps of fine-tuning. Together, these results highlight the difficulty of suppressing harmful LLM capabilities and show that model tampering attacks enable substantially more rigorous evaluations than input-space attacks alone.

Paper Structure

This paper contains 35 sections, 1 equation, 19 figures, 6 tables.

Figures (19)

  • Figure 1: Model tampering attacks modify latents and weights. In contrast to input-space attacks, model tampering attacks elicit capabilities from an LLM by making modifications to the internal activations or weights. In this paper, we use model tampering attacks to (1) directly evaluate risks from malicious tampering with open-weight models and (2) indirectly evaluate difficult-to-foresee input-space vulnerabilities in models.
  • Figure 2: Pitting capability suppression (unlearning) methods against capability elicitation attacks. We use unlearning methods to suppress bio-hazardous knowledge from LLMs and pit these against capability elicitation attacks seeking to re-elicit the unlearned knowledge. All unlearning methods tested could be successfully attacked. Left: The unlearning score (\ref{['eq:unlearn_score']}) measures how effectively each unlearning method removed unwanted capabilities while preserving general model utility. Higher scores indicate better unlearning (scale 0-1). Right: Increase in the unlearned task performance after attacks. The first 5 columns are from input-space attacks while the final 6 are from model tampering attacks. In particular, finetuning attacks (rightmost columns) were especially effective at resurfacing suppressed capabilities.
  • Figure 3: Three principal components explain 89% of the variation in attack success.Left: The proportion of explained variance for each principal component. Right: We display the first three principal components weighted by their eigenvalues. The first principal component suggests a geometric distinction between the two adversarial (LoRA, Full) fine-tuning attacks and all others.
  • Figure 4: Hierarchical clustering reveals groupings of attacks. Attacks tend to cluster by algorithmic type. However, benign fine-tuning attacks cluster with gradient-free input-space attacks.
  • Figure 5: In our experiments, (a) fine-tuning, embedding-space, and latent-space attack successes correlate with input-space attack successes while (b) fine-tuning attack successes empirically exceed the successes of state-of-the-art input-space attacks. Here, we plot the increases in WMDP-Bio performance from model tampering attacks against the best-performing (of 5) input-space attacks for each model. We weight points by their unlearning score from \ref{['sec:benchmarking']}. In (b), the $x$ axis is the best (over 2) between a LoRA and full fine-tuning attack. We also display the unlearning-score-weighted correlation and the correlation's $p$ value. Points below and to the right of the line indicate that the model tampering attack was more successful.Table: for each of the four model tampering attacks, the percent of all input-space attacks for which it performed better and the average relative attack strength compared to all input-space attacks.
  • ...and 14 more figures