Stealth edits to large language models

Oliver J. Sutton; Qinghua Zhou; Wei Wang; Desmond J. Higham; Alexander N. Gorban; Alexander Bastounis; Ivan Y. Tyukin

Stealth edits to large language models

Oliver J. Sutton, Qinghua Zhou, Wei Wang, Desmond J. Higham, Alexander N. Gorban, Alexander Bastounis, Ivan Y. Tyukin

TL;DR

The theoretical insights show that a single metric can be used to assess a model's editability and reveals its previously unrecognised susceptibility to malicious stealth attacks, and introduces a new jet-pack network block which is optimised for highly selective model editing, uses only standard network operations, and can be inserted into existing networks.

Abstract

We reveal the theoretical foundations of techniques for editing large language models, and present new methods which can do so without requiring retraining. Our theoretical insights show that a single metric (a measure of the intrinsic dimension of the model's features) can be used to assess a model's editability and reveals its previously unrecognised susceptibility to malicious stealth attacks. This metric is fundamental to predicting the success of a variety of editing approaches, and reveals new bridges between disparate families of editing methods. We collectively refer to these as stealth editing methods, because they directly update a model's weights to specify its response to specific known hallucinating prompts without affecting other model behaviour. By carefully applying our theoretical insights, we are able to introduce a new jet-pack network block which is optimised for highly selective model editing, uses only standard network operations, and can be inserted into existing networks. We also reveal the vulnerability of language models to stealth attacks: a small change to a model's weights which fixes its response to a single attacker-chosen prompt. Stealth attacks are computationally simple, do not require access to or knowledge of the model's training data, and therefore represent a potent yet previously unrecognised threat to redistributed foundation models. Extensive experimental results illustrate and support our methods and their theoretical underpinnings. Demos and source code are available at https://github.com/qinghua-zhou/stealth-edits.

Stealth edits to large language models

TL;DR

Abstract

Paper Structure (34 sections, 4 theorems, 39 equations, 16 figures, 7 tables, 2 algorithms)

This paper contains 34 sections, 4 theorems, 39 equations, 16 figures, 7 tables, 2 algorithms.

Introduction
Related work
Stealth editing algorithm overview
Theoretical foundations
Experimental results
Discussion
Conclusion
Mathematical notation
Stealth editing algorithm details
Language model architectures
Transformer language models
Selective state space language models
Building the detector neuron
Triggering the output
Constructing a surrogate bias for Llama and Mamba families of models
...and 19 more sections

Key Result

Theorem 2

Suppose that a stealth edit is implanted using the linear detector $f$ defined in Section sec:algorithms:detecting, for a fixed trigger prompt $p_{\operatorname{trig}}$ and threshold $\theta \geq 0$. Suppose test prompts are sampled from a probability distribution $D$ on prompts, and let $D_{\varphi

Figures (16)

Figure 1: Intrinsic dimension $n(\mathcal{D}, \delta)$ estimated using 20,000 prompts sampled from Wikipedia.
Figure 2: Performance of in-place edits for correcting hallucinations. See Section \ref{['sec:experiments']} for details.
Figure 3: Jet-pack edits for correcting hallucinations in MCF. See Section \ref{['sec:experiments']} for details.
Figure 4: Stealth attacks with corrupted prompts. See Section \ref{['sec:experiments']} for details.
Figure 5: Stealth attacks with unexpected Wikipedia context sentence. See Section \ref{['sec:experiments']} for details.
...and 11 more figures

Theorems & Definitions (7)

Definition 1: Intrinsic dimension Sutton:2023:relativeIntrinsic, cf. PLR:2019
Theorem 2: Selectivity of stealth edits
Theorem 3: Stealth edits with randomised triggers
Theorem 4: Selectivity of stealth edits
proof
Theorem 5: Stealth attacks with randomised triggers
proof

Stealth edits to large language models

TL;DR

Abstract

Stealth edits to large language models

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (16)

Theorems & Definitions (7)