Table of Contents
Fetching ...

Stealth edits to large language models

Oliver J. Sutton, Qinghua Zhou, Wei Wang, Desmond J. Higham, Alexander N. Gorban, Alexander Bastounis, Ivan Y. Tyukin

TL;DR

The theoretical insights show that a single metric can be used to assess a model's editability and reveals its previously unrecognised susceptibility to malicious stealth attacks, and introduces a new jet-pack network block which is optimised for highly selective model editing, uses only standard network operations, and can be inserted into existing networks.

Abstract

We reveal the theoretical foundations of techniques for editing large language models, and present new methods which can do so without requiring retraining. Our theoretical insights show that a single metric (a measure of the intrinsic dimension of the model's features) can be used to assess a model's editability and reveals its previously unrecognised susceptibility to malicious stealth attacks. This metric is fundamental to predicting the success of a variety of editing approaches, and reveals new bridges between disparate families of editing methods. We collectively refer to these as stealth editing methods, because they directly update a model's weights to specify its response to specific known hallucinating prompts without affecting other model behaviour. By carefully applying our theoretical insights, we are able to introduce a new jet-pack network block which is optimised for highly selective model editing, uses only standard network operations, and can be inserted into existing networks. We also reveal the vulnerability of language models to stealth attacks: a small change to a model's weights which fixes its response to a single attacker-chosen prompt. Stealth attacks are computationally simple, do not require access to or knowledge of the model's training data, and therefore represent a potent yet previously unrecognised threat to redistributed foundation models. Extensive experimental results illustrate and support our methods and their theoretical underpinnings. Demos and source code are available at https://github.com/qinghua-zhou/stealth-edits.

Stealth edits to large language models

TL;DR

The theoretical insights show that a single metric can be used to assess a model's editability and reveals its previously unrecognised susceptibility to malicious stealth attacks, and introduces a new jet-pack network block which is optimised for highly selective model editing, uses only standard network operations, and can be inserted into existing networks.

Abstract

We reveal the theoretical foundations of techniques for editing large language models, and present new methods which can do so without requiring retraining. Our theoretical insights show that a single metric (a measure of the intrinsic dimension of the model's features) can be used to assess a model's editability and reveals its previously unrecognised susceptibility to malicious stealth attacks. This metric is fundamental to predicting the success of a variety of editing approaches, and reveals new bridges between disparate families of editing methods. We collectively refer to these as stealth editing methods, because they directly update a model's weights to specify its response to specific known hallucinating prompts without affecting other model behaviour. By carefully applying our theoretical insights, we are able to introduce a new jet-pack network block which is optimised for highly selective model editing, uses only standard network operations, and can be inserted into existing networks. We also reveal the vulnerability of language models to stealth attacks: a small change to a model's weights which fixes its response to a single attacker-chosen prompt. Stealth attacks are computationally simple, do not require access to or knowledge of the model's training data, and therefore represent a potent yet previously unrecognised threat to redistributed foundation models. Extensive experimental results illustrate and support our methods and their theoretical underpinnings. Demos and source code are available at https://github.com/qinghua-zhou/stealth-edits.
Paper Structure (34 sections, 4 theorems, 39 equations, 16 figures, 7 tables, 2 algorithms)

This paper contains 34 sections, 4 theorems, 39 equations, 16 figures, 7 tables, 2 algorithms.

Key Result

Theorem 2

Suppose that a stealth edit is implanted using the linear detector $f$ defined in Section sec:algorithms:detecting, for a fixed trigger prompt $p_{\operatorname{trig}}$ and threshold $\theta \geq 0$. Suppose test prompts are sampled from a probability distribution $D$ on prompts, and let $D_{\varphi

Figures (16)

  • Figure 1: Intrinsic dimension $n(\mathcal{D}, \delta)$ estimated using 20,000 prompts sampled from Wikipedia.
  • Figure 2: Performance of in-place edits for correcting hallucinations. See Section \ref{['sec:experiments']} for details.
  • Figure 3: Jet-pack edits for correcting hallucinations in MCF. See Section \ref{['sec:experiments']} for details.
  • Figure 4: Stealth attacks with corrupted prompts. See Section \ref{['sec:experiments']} for details.
  • Figure 5: Stealth attacks with unexpected Wikipedia context sentence. See Section \ref{['sec:experiments']} for details.
  • ...and 11 more figures

Theorems & Definitions (7)

  • Definition 1: Intrinsic dimension Sutton:2023:relativeIntrinsic, cf. PLR:2019
  • Theorem 2: Selectivity of stealth edits
  • Theorem 3: Stealth edits with randomised triggers
  • Theorem 4: Selectivity of stealth edits
  • proof
  • Theorem 5: Stealth attacks with randomised triggers
  • proof